Text Analytics

Sanjiv Ranjan Das

2016-12-11

Introduction

Reference monograph

Text expands the universe of data by many-fold. See my monograph on text mining in finance at: http://srdas.github.io/Das_TextAnalyticsInFinance.pdf

This covers some of the content of this presentation. These files are useful for the talk itself and you may run the program code as we proceed.

http://srdas.github.io/Temp/user2016/

Text as Data

  1. Big Text: there is more textual data than numerical data.
  2. Text is versatile. Nuances and behavioral expressions that are not conveyed with numbers.
  3. Text contains emotive content. Sentiment analysis. Admati-Pfleiderer 2001; DeMarzo et al 2003; Antweiler-Frank 2004, 2005; Das-Chen 2007; Tetlock 2007; Tetlock et al 2008; Mitra et al 2008; Leinweber-Sisk 2010.
  4. Text contains opinions and connections. Das et al 2005; Das and Sisk 2005; Godes et al 2005; Li 2006; Hochberg et al 2007.
  5. Numbers aggregate; text disaggregates.

Anecdotal …

  1. In a talk at the 17th ACM Conference on Information Knowledge and Management (CIKM ’08), Google’s director of research Peter Norvig stated his unequivocal preference for data over algorithms—“data is more agile than code.” Yet, it is well-understood that too much data can lead to overfitting so that an algorithm becomes mostly useless out-of-sample.
  2. Chris Anderson: “Data is the New Theory.”
  3. These issues are relevant to text mining, but let’s put them on hold till the end of the session.

Definition: Text-Mining

  1. Text mining is the large-scale, automated processing of plain text language in digital form to extract data that is converted into useful quantitative or qualitative information.
  2. Text mining is automated on big data that is not amenable to human processing within reasonable time frames. It entails extracting data that is converted into information of many types.
  3. Simple: Text mining may be simple as in key word searches and counts.
  4. Complicated: It may require language parsing and complex rules for information extraction.
  5. Structured text, such as the information in forms and some kinds of web pages.
  6. Unstructured text is a much harder endeavor.
  7. Text mining is also aimed at unearthing unseen relationships in unstructured text as in meta analyses of research papers, see Van Noorden 2012.

Definition: News Analytics

Wikipedia defines it as - “… the measurement of the various qualitative and quantitative attributes of textual (unstructured data) news stories. Some of these attributes are: sentiment, relevance, and novelty. Expressing news stories as numbers permits the manipulation of everyday information in a mathematical and statistical way. News analytics are used in financial modeling, particularly in quantitative and algorithmic trading. Further, news analytics can be used to plot and characterize firm behaviors over time and thus yield important strategic insights about rival firms. News analytics are usually derived through automated text analysis and applied to digital texts using elements from natural language processing and machine learning such as latent semantic analysis, support vector machines, `bag of words’, among other techniques.”

https://www.amazon.com/Handbook-News-Analytics-Finance/dp/047066679X/ref=sr_1_1?ie=UTF8&qid=1466897817&sr=8-1&keywords=handbook+of+news+analytics

Data and Algorithms

Text Extraction

The R programming language is increasingly being used to download text from the web and then analyze it. The ease with which R may be used to scrape text from web site may be seen from the following simple command in R:

text = readLines("http://srdas.github.io/bio-candid.html")
text[15:20]
## [1] "being an academic, he worked in the derivatives business in the"      
## [2] "Asia-Pacific region as a Vice-President at Citibank. His current"     
## [3] "research interests include: the modeling of default risk, machine"    
## [4] "learning, social networks, derivatives pricing models, portfolio"     
## [5] "theory, and venture capital. He has published over ninety articles in"
## [6] "academic journals, and has won numerous awards for research and"

Here, we downloaded the my bio page from my university’s web site. It’s a simple HTML file.

length(text)
## [1] 79

String Parsing

Suppose we just want the 17th line, we do:

text[17]
## [1] "research interests include: the modeling of default risk, machine"

And, to find out the character length of the this line we use the function:

library(stringr)
## Warning: package 'stringr' was built under R version 3.2.5
str_length(text[17])
## [1] 65

We have first invoked the library stringr that contains many string handling functions. In fact, we may also get the length of each line in the text vector by applying the function length() to the entire text vector.

text_len = str_length(text)
print(text_len)
##  [1]  6 69  0 66 70 70 70 63 69 65 68 67 64 67 63 64 65 64 69 63 68 70 39
## [24]  0  0 56  0 65 67 66 65 64 66 69 63 69 65 27  0  3  0 71 71 69 68 71
## [47] 12  0  3  0 71 70 68 71 69 63 67 69 64 67  7  0  3  0 67 71 65 63 72
## [70] 69 68 66 69 70 70 43  0  0  0
print(text_len[55])
## [1] 69
text_len[17]
## [1] 65

Sort by Length

Some lines are very long and are the ones we are mainly interested in as they contain the bulk of the story, whereas many of the remaining lines that are shorter contain html formatting instructions. Thus, we may extract the top three lengthy lines with the following set of commands.

res = sort(text_len,decreasing=TRUE,index.return=TRUE)
idx = res$ix
text2 = text[idx]
text2
##  [1] "important to open the academic door to the ivory tower and let the world"
##  [2] "Sanjiv is now a Professor of Finance at Santa Clara University. He came" 
##  [3] "to SCU from Harvard Business School and spent a year at UC Berkeley. In" 
##  [4] "previous lives into his current existence, which is incredibly confused" 
##  [5] "Sanjiv's research style is instilled with a distinct \"New York state of"
##  [6] "funds, the internet, portfolio choice, banking models, credit risk, and" 
##  [7] "ocean.  The many walks in Greenwich village convinced him that there is" 
##  [8] "Santa Clara University's Leavey School of Business. He previously held"  
##  [9] "faculty appointments as Associate Professor at Harvard Business School"  
## [10] "and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and"  
## [11] "published in May 2010.  He currently also serves as a Senior Fellow at"  
## [12] "mind\" - it is chaotic, diverse, with minimal method to the madness. He" 
## [13] "any time you like, but you can never leave.\" Which is why he is doomed" 
## [14] "to a lifetime in Hotel California. And he believes that, if this is as"  
## [15] "<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">" 
## [16] "Berkeley), an MBA from the Indian Institute of Management, Ahmedabad,"   
## [17] "theory, and venture capital. He has published over ninety articles in"   
## [18] "science fiction movies, and writing cool software code. When there is"   
## [19] "academic papers, which helps him relax. Always the contrarian, Sanjiv"   
## [20] "his past life in the unreal world, Sanjiv worked at Citibank, N.A. in"   
## [21] "has unpublished articles in many other areas. Some years ago, he took"   
## [22] "There he learnt about the fascinating field of Randomized Algorithms,"   
## [23] "in. Academia is a real challenge, given that he has to reconcile many"   
## [24] "explains, you never really finish your education - \"you can check out"  
## [25] "College), and is also a qualified Cost and Works Accountant. He is a"    
## [26] "teaching. His recent book \"Derivatives: Principles and Practice\" was"  
## [27] "the Asia-Pacific region. He takes great pleasure in merging his many"    
## [28] "has published articles on derivatives, term-structure models, mutual"    
## [29] "more opinions than ideas. He has been known to have turned down many"    
## [30] "senior editor of The Journal of Investment Management, co-editor of"     
## [31] "Research, and Associate Editor of other academic journals. Prior to"     
## [32] "growing up, Sanjiv moved to New York to change the world, hopefully"     
## [33] "confirming that an unchecked hobby can quickly become an obsession."     
## [34] "pursuits, many of which stem from being in the epicenter of Silicon"     
## [35] "Coastal living did a lot to mold Sanjiv, who needs to live near the"     
## [36] "Sanjiv Das is the William and Janice Terry Professor of Finance at"      
## [37] "through research.  He graduated in 1994 with a Ph.D. from NYU, and"      
## [38] "mountains meet the sea, riding sport motorbikes, reading, gadgets,"      
## [39] "offers from Mad magazine to publish his academic work. As he often"      
## [40] "B.Com in Accounting and Economics (University of Bombay, Sydenham"       
## [41] "research interests include: the modeling of default risk, machine"       
## [42] "After loafing and working in many parts of Asia, but never really"       
## [43] "since then spent five years in Boston, and now lives in San Jose,"       
## [44] "thinks that New York City is the most calming place in the world,"       
## [45] "no such thing as a representative investor, yet added many unique"       
## [46] "The Journal of Derivatives and The Journal of Financial Services"        
## [47] "Asia-Pacific region as a Vice-President at Citibank. His current"        
## [48] "learning, social networks, derivatives pricing models, portfolio"        
## [49] "California.  Sanjiv loves animals, places in the world where the"        
## [50] "skills he now applies earnestly to his editorial work, and other"        
## [51] "Ph.D. from New York University), Computer Science (M.S. from UC"         
## [52] "being an academic, he worked in the derivatives business in the"         
## [53] "academic journals, and has won numerous awards for research and"         
## [54] "time available from the excitement of daily life, Sanjiv writes"         
## [55] "time off to get another degree in computer science at Berkeley,"         
## [56] "features to his personal utility function. He learnt that it is"         
## [57] "<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>"                
## [58] "bad as it gets, life is really pretty good."                             
## [59] "the FDIC Center for Financial Research."                                 
## [60] "after California of course."                                             
## [61] "and diverse."                                                            
## [62] "Valley."                                                                 
## [63] "<HTML>"                                                                  
## [64] "<p>"                                                                     
## [65] "<p>"                                                                     
## [66] "<p>"                                                                     
## [67] ""                                                                        
## [68] ""                                                                        
## [69] ""                                                                        
## [70] ""                                                                        
## [71] ""                                                                        
## [72] ""                                                                        
## [73] ""                                                                        
## [74] ""                                                                        
## [75] ""                                                                        
## [76] ""                                                                        
## [77] ""                                                                        
## [78] ""                                                                        
## [79] ""

Text cleanup

In short, text extraction can be exceedingly simple, though getting clean text is not as easy an operation. Removing html tags and other unnecessary elements in the file is also a fairly simple operation. We undertake the following steps that use generalized regular expressions (i.e., grep) to eliminate html formatting characters.

This will generate one single paragraph of text, relatively clean of formatting characters. Such a text collection is also known as a “bag of words”.

text = paste(text,collapse="\n")
print(text)
## [1] "<HTML>\n<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">\n\nSanjiv Das is the William and Janice Terry Professor of Finance at\nSanta Clara University's Leavey School of Business. He previously held\nfaculty appointments as Associate Professor at Harvard Business School\nand UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and\nPh.D. from New York University), Computer Science (M.S. from UC\nBerkeley), an MBA from the Indian Institute of Management, Ahmedabad,\nB.Com in Accounting and Economics (University of Bombay, Sydenham\nCollege), and is also a qualified Cost and Works Accountant. He is a\nsenior editor of The Journal of Investment Management, co-editor of\nThe Journal of Derivatives and The Journal of Financial Services\nResearch, and Associate Editor of other academic journals. Prior to\nbeing an academic, he worked in the derivatives business in the\nAsia-Pacific region as a Vice-President at Citibank. His current\nresearch interests include: the modeling of default risk, machine\nlearning, social networks, derivatives pricing models, portfolio\ntheory, and venture capital. He has published over ninety articles in\nacademic journals, and has won numerous awards for research and\nteaching. His recent book \"Derivatives: Principles and Practice\" was\npublished in May 2010.  He currently also serves as a Senior Fellow at\nthe FDIC Center for Financial Research.\n\n\n<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>\n\nAfter loafing and working in many parts of Asia, but never really\ngrowing up, Sanjiv moved to New York to change the world, hopefully\nthrough research.  He graduated in 1994 with a Ph.D. from NYU, and\nsince then spent five years in Boston, and now lives in San Jose,\nCalifornia.  Sanjiv loves animals, places in the world where the\nmountains meet the sea, riding sport motorbikes, reading, gadgets,\nscience fiction movies, and writing cool software code. When there is\ntime available from the excitement of daily life, Sanjiv writes\nacademic papers, which helps him relax. Always the contrarian, Sanjiv\nthinks that New York City is the most calming place in the world,\nafter California of course.\n\n<p>\n\nSanjiv is now a Professor of Finance at Santa Clara University. He came\nto SCU from Harvard Business School and spent a year at UC Berkeley. In\nhis past life in the unreal world, Sanjiv worked at Citibank, N.A. in\nthe Asia-Pacific region. He takes great pleasure in merging his many\nprevious lives into his current existence, which is incredibly confused\nand diverse.\n\n<p>\n\nSanjiv's research style is instilled with a distinct \"New York state of\nmind\" - it is chaotic, diverse, with minimal method to the madness. He\nhas published articles on derivatives, term-structure models, mutual\nfunds, the internet, portfolio choice, banking models, credit risk, and\nhas unpublished articles in many other areas. Some years ago, he took\ntime off to get another degree in computer science at Berkeley,\nconfirming that an unchecked hobby can quickly become an obsession.\nThere he learnt about the fascinating field of Randomized Algorithms,\nskills he now applies earnestly to his editorial work, and other\npursuits, many of which stem from being in the epicenter of Silicon\nValley.\n\n<p>\n\nCoastal living did a lot to mold Sanjiv, who needs to live near the\nocean.  The many walks in Greenwich village convinced him that there is\nno such thing as a representative investor, yet added many unique\nfeatures to his personal utility function. He learnt that it is\nimportant to open the academic door to the ivory tower and let the world\nin. Academia is a real challenge, given that he has to reconcile many\nmore opinions than ideas. He has been known to have turned down many\noffers from Mad magazine to publish his academic work. As he often\nexplains, you never really finish your education - \"you can check out\nany time you like, but you can never leave.\" Which is why he is doomed\nto a lifetime in Hotel California. And he believes that, if this is as\nbad as it gets, life is really pretty good.\n\n\n"
text = str_replace_all(text,"[<>{}()&;,.\n]"," ")
print(text)
## [1] " HTML   BODY background=\"http://algo scu edu/~sanjivdas/graphics/back2 gif\"   Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara University's Leavey School of Business  He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley  He holds post-graduate degrees in Finance  M Phil and Ph D  from New York University   Computer Science  M S  from UC Berkeley   an MBA from the Indian Institute of Management  Ahmedabad  B Com in Accounting and Economics  University of Bombay  Sydenham College   and is also a qualified Cost and Works Accountant  He is a senior editor of The Journal of Investment Management  co-editor of The Journal of Derivatives and The Journal of Financial Services Research  and Associate Editor of other academic journals  Prior to being an academic  he worked in the derivatives business in the Asia-Pacific region as a Vice-President at Citibank  His current research interests include: the modeling of default risk  machine learning  social networks  derivatives pricing models  portfolio theory  and venture capital  He has published over ninety articles in academic journals  and has won numerous awards for research and teaching  His recent book \"Derivatives: Principles and Practice\" was published in May 2010   He currently also serves as a Senior Fellow at the FDIC Center for Financial Research     p   B Sanjiv Das: A Short Academic Life History /B   p   After loafing and working in many parts of Asia  but never really growing up  Sanjiv moved to New York to change the world  hopefully through research   He graduated in 1994 with a Ph D  from NYU  and since then spent five years in Boston  and now lives in San Jose  California   Sanjiv loves animals  places in the world where the mountains meet the sea  riding sport motorbikes  reading  gadgets  science fiction movies  and writing cool software code  When there is time available from the excitement of daily life  Sanjiv writes academic papers  which helps him relax  Always the contrarian  Sanjiv thinks that New York City is the most calming place in the world  after California of course    p   Sanjiv is now a Professor of Finance at Santa Clara University  He came to SCU from Harvard Business School and spent a year at UC Berkeley  In his past life in the unreal world  Sanjiv worked at Citibank  N A  in the Asia-Pacific region  He takes great pleasure in merging his many previous lives into his current existence  which is incredibly confused and diverse    p   Sanjiv's research style is instilled with a distinct \"New York state of mind\" - it is chaotic  diverse  with minimal method to the madness  He has published articles on derivatives  term-structure models  mutual funds  the internet  portfolio choice  banking models  credit risk  and has unpublished articles in many other areas  Some years ago  he took time off to get another degree in computer science at Berkeley  confirming that an unchecked hobby can quickly become an obsession  There he learnt about the fascinating field of Randomized Algorithms  skills he now applies earnestly to his editorial work  and other pursuits  many of which stem from being in the epicenter of Silicon Valley    p   Coastal living did a lot to mold Sanjiv  who needs to live near the ocean   The many walks in Greenwich village convinced him that there is no such thing as a representative investor  yet added many unique features to his personal utility function  He learnt that it is important to open the academic door to the ivory tower and let the world in  Academia is a real challenge  given that he has to reconcile many more opinions than ideas  He has been known to have turned down many offers from Mad magazine to publish his academic work  As he often explains  you never really finish your education - \"you can check out any time you like  but you can never leave \" Which is why he is doomed to a lifetime in Hotel California  And he believes that  if this is as bad as it gets  life is really pretty good    "

The XML Package

The XML package in R also comes with many functions that aid in cleaning up text and dropping it (mostly unformatted) into a flat file or data frame. This may then be further processed. Here is some example code for this.

Processing XML files in R into a data frame

The following example has been adapted from r-bloggers.com. It uses the following URL:

http://www.w3schools.com/xml/plant_catalog.xml

library(XML)
## Warning: package 'XML' was built under R version 3.2.4
#Part1: Reading an xml and creating a data frame with it.

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"
xmlfile <- xmlTreeParse(xml.url)
xmltop <- xmlRoot(xmlfile)
plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue))
plantcat_df <- data.frame(t(plantcat),row.names=NULL)
plantcat_df[1:5,1:4]
##                COMMON              BOTANICAL ZONE        LIGHT
## 1           Bloodroot Sanguinaria canadensis    4 Mostly Shady
## 2           Columbine   Aquilegia canadensis    3 Mostly Shady
## 3      Marsh Marigold       Caltha palustris    4 Mostly Sunny
## 4             Cowslip       Caltha palustris    4 Mostly Shady
## 5 Dutchman's-Breeches    Dicentra cucullaria    3 Mostly Shady

Creating a XML file from a data frame

#Example adapted from https://stat.ethz.ch/pipermail/r-help/2008-September/175364.html
#Load the iris data set and create a data frame
data("iris")
data <- as.data.frame(iris)

xml <- xmlTree()
xml$addTag("document", close=FALSE)
## Warning in xmlRoot.XMLInternalDocument(currentNodes[[1]]): empty XML
## document
for (i in 1:nrow(data)) {
  xml$addTag("row", close=FALSE)
  for (j in names(data)) {
    xml$addTag(j, data[i, j])
  }
  xml$closeTag()
}
xml$closeTag()

#view the xml (uncomment line below to see XML, long output)
#cat(saveXML(xml))

The Response to News

Das, Martinez-Jerez, and Tufano (FM 2005)

Breakdown of News Flow

Frequency of Postings

Weekly Posting

Intraday Posting

Number of Characters per Posting

Text Handling

First, let’s read in a simple web page (my landing page)

text = readLines("http://srdas.github.io/")
print(text[1:4])
## [1] "<html>"                                          
## [2] ""                                                
## [3] "<head>"                                          
## [4] "<title>SCU Web Page of Sanjiv Ranjan Das</title>"
print(length(text))
## [1] 36

String Detection

String handling is a basic need, so we use the stringr package.

#EXTRACTING SUBSTRINGS (take some time to look at
#the "stringr" package also)
library(stringr)
substr(text[4],24,29)
## [1] "Sanjiv"
#IF YOU WANT TO LOCATE A STRING
res = regexpr("Sanjiv",text[4])
print(res)
## [1] 24
## attr(,"match.length")
## [1] 6
## attr(,"useBytes")
## [1] TRUE
print(substr(text[4],res[1],res[1]+nchar("Sanjiv")-1))
## [1] "Sanjiv"
#ANOTHER WAY
res = str_locate(text[4],"Sanjiv")
print(res)
##      start end
## [1,]    24  29
print(substr(text[4],res[1],res[2]))
## [1] "Sanjiv"

Cleaning Text

Now we look at using regular expressions with the grep command to clean out text. I will read in my research page to process this. Here we are undertaking a “ruthless” cleanup.

#SIMPLE TEXT HANDLING
text = readLines("http://srdas.github.io/research.htm")
print(length(text))
## [1] 823
print(text)
##   [1] "<HTML>"                                                                                                                                                                                                                                                                                                          
##   [2] "<HEAD>"                                                                                                                                                                                                                                                                                                          
##   [3] "<TITLE>Research of Professor Sanjiv Ranjan Das</TITLE>"                                                                                                                                                                                                                                                          
##   [4] "<BASE HREF=\"http://srdas.github.io/\">"                                                                                                                                                                                                                                                                         
##   [5] "</HEAD>"                                                                                                                                                                                                                                                                                                         
##   [6] "<BODY background=\"http://srdas.github.io/graphics/back2.gif\">"                                                                                                                                                                                                                                                 
##   [7] ""                                                                                                                                                                                                                                                                                                                
##   [8] "<H2>BOOKS and MONOGRAPHS</H2>"                                                                                                                                                                                                                                                                                   
##   [9] ""                                                                                                                                                                                                                                                                                                                
##  [10] "<OL reversed>"                                                                                                                                                                                                                                                                                                   
##  [11] ""                                                                                                                                                                                                                                                                                                                
##  [12] "<LI><img src=\"graphics/DSTMAA.png\" width=\"50\" height=\"65\">"                                                                                                                                                                                                                                                
##  [13] "\"Data Science: Theories, Models, Algorithms, and Analytics\" (web book -- work in progress)"                                                                                                                                                                                                                    
##  [14] "<a href=\"http://srdas.github.io/Papers/DSA_Book.pdf\">Read here.</a>"                                                                                                                                                                                                                                           
##  [15] ""                                                                                                                                                                                                                                                                                                                
##  [16] ""                                                                                                                                                                                                                                                                                                                
##  [17] "<LI><img src=\"graphics/derbook_cover.png\" width=\"50\" height=\"65\">"                                                                                                                                                                                                                                         
##  [18] "\"Derivatives: Principles and Practice\" (2010),"                                                                                                                                                                                                                                                                
##  [19] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."                                                                                                                                                                                                                                                              
##  [20] "<a href=\"http://www.amazon.com/Derivatives-Rangarajan-Sundaram/dp/0072949317/ref=sr_1_1?ie=UTF8&s=books&qid=1268798971&sr=8-1\">[Amazon]</a>"                                                                                                                                                                   
##  [21] "<a href=\"http://productsearch.barnesandnoble.com/search/results.aspx?WRD=sundaram+das\">[BarnesNoble]</a>"                                                                                                                                                                                                      
##  [22] ""                                                                                                                                                                                                                                                                                                                
##  [23] "</OL>"                                                                                                                                                                                                                                                                                                           
##  [24] ""                                                                                                                                                                                                                                                                                                                
##  [25] "<H2>REFEREED JOURNAL PUBLICATIONS</H2>"                                                                                                                                                                                                                                                                          
##  [26] ""                                                                                                                                                                                                                                                                                                                
##  [27] "<OL reversed>"                                                                                                                                                                                                                                                                                                   
##  [28] ""                                                                                                                                                                                                                                                                                                                
##  [29] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
##  [30] "\"An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."                                                                                                                                                                                                                             
##  [31] "Forthcoming, <I>Journal of Banking and Finance</I>."                                                                                                                                                                                                                                                             
##  [32] "<br>[<I> [Develops a new measure of liquidity for all sectors of the markets using ETFs. "                                                                                                                                                                                                                       
##  [33] "RFinance Best Paper Award, May 2016. This paper won the S&P SPIVA 2012 Award for innovation of an index.</I>]"                                                                                                                                                                                                   
##  [34] "<a href=\"Papers/etfliq.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
##  [35] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [36] ""                                                                                                                                                                                                                                                                                                                
##  [37] "<LI><img src=\"graphics/JAI.png\" width=\"55\" height=\"40\">"                                                                                                                                                                                                                                                   
##  [38] "\"Matrix Metrics: Network-Based Systemic Risk Scoring\", (2016)."                                                                                                                                                                                                                                                
##  [39] "<I>Journal of Alternative Investments</I>, Special Issue on Systemic Risk, v18(4), 33-51."                                                                                                                                                                                                                       
##  [40] "<br>[<I>A new approach to identifying system-wide financial risk, SIFIs, and several other measures"                                                                                                                                                                                                             
##  [41] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "                                                                                                                                                                                                                           
##  [42] "the best paper on SIFIs (systemically important financial institutions). "                                                                                                                                                                                                                                       
##  [43] "It also won the best paper award at "                                                                                                                                                                                                                                                                            
##  [44] "the R Finance conference, Chicago 2015. </I>]"                                                                                                                                                                                                                                                                   
##  [45] "<a href=\"Papers/JAI_Das_issue.pdf\">[PDF of paper]</a>"                                                                                                                                                                                                                                                         
##  [46] "<a href=\"Papers/JAI_EditorsLetter_issue.pdf\">[Editor's letter re Special Issue]</a>"                                                                                                                                                                                                                           
##  [47] "<a href=\"Papers/JAI_Getmansky_Stein_issue.pdf\">[Editor's overview]</a>"                                                                                                                                                                                                                                        
##  [48] "<a href=\"Papers/RiskNetworks_slides_RFinance_2015_05.pdf\">[SLIDES RFinance]</a>. "                                                                                                                                                                                                                             
##  [49] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [50] ""                                                                                                                                                                                                                                                                                                                
##  [51] ""                                                                                                                                                                                                                                                                                                                
##  [52] ""                                                                                                                                                                                                                                                                                                                
##  [53] ""                                                                                                                                                                                                                                                                                                                
##  [54] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
##  [55] "\"Credit Spreads with Dynamic Debt\" (with Seoyoung Kim), (2015), "                                                                                                                                                                                                                                              
##  [56] "<I>Journal of Banking and Finance</I>, v50, 121-140."                                                                                                                                                                                                                                                            
##  [57] "<a href=\"Papers/DasKim_JBF2015_FINAL.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                           
##  [58] "<br>[<I>Extends the Merton risky debt model from static debt to dynamic debt"                                                                                                                                                                                                                                    
##  [59] "and generates credit spread term structures that are closer to those in the data</I>]"                                                                                                                                                                                                                           
##  [60] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [61] ""                                                                                                                                                                                                                                                                                                                
##  [62] "<LI><img src=\"graphics/FTF.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
##  [63] "\"Text and Context: Language Analytics for Finance\", (2014),"                                                                                                                                                                                                                                                   
##  [64] "<I>Foundations and Trends in Finance</I>, v8(3), 145-260. "                                                                                                                                                                                                                                                      
##  [65] "<a href=\"Papers/Das_TextAnalyticsInFinance.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                     
##  [66] "<br>[<I>A comprehensive survey of comcepts, tools, techniques, and empirical "                                                                                                                                                                                                                                   
##  [67] "literature on textual processing in finance.</I>]"                                                                                                                                                                                                                                                               
##  [68] ""                                                                                                                                                                                                                                                                                                                
##  [69] ""                                                                                                                                                                                                                                                                                                                
##  [70] "<LI><img src=\"graphics/jfe.gif\" width=\"40\" height=\"55\">\"Did CDS Trading Improve the Market for Corporate Bonds?\" (with Madhu Kalimipalli and Subhankar Nayak), (2014), <I>Journal of Financial Economics</I> 111, 495-525."                                                                              
##  [71] "<a href=\"Papers/cdsbondeff.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
##  [72] "<br>[<I>The inception of CDS trading in a reference name renders its bonds less efficient, with no improvement in market quality or liquidity</I>]"                                                                                                                                                              
##  [73] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [74] ""                                                                                                                                                                                                                                                                                                                
##  [75] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
##  [76] "\"Strategic Loan Modification: An Options-Based Response to Strategic Default,\""                                                                                                                                                                                                                                
##  [77] "(with Ray Meadows), (2013), <I>Journal of Banking and Finance</I> 37, 636-647. "                                                                                                                                                                                                                                 
##  [78] "<a href=\"Papers/sam.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                            
##  [79] "<br>[<I>A closed-form solution for mortgage debt with default and optimal loan modificatoin thereon.</I>]"                                                                                                                                                                                                       
##  [80] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [81] ""                                                                                                                                                                                                                                                                                                                
##  [82] ""                                                                                                                                                                                                                                                                                                                
##  [83] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
##  [84] "\"Options and Structured Products in Behavioral Portfolios,\" (with Meir Statman), (2013), "                                                                                                                                                                                                                     
##  [85] "<I>Journal of Economic Dynamics and Control</I>, 37(1), 137-153."                                                                                                                                                                                                                                                
##  [86] "<a href=\"Papers/JEDC_FINAL_PROOF.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                               
##  [87] "<br>[<I>Explores the roles in behavioral portfolios of option collars, capital guaranteed notes, "                                                                                                                                                                                                               
##  [88] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."                                                                                                                                                                                                                                  
##  [89] "</I>]"                                                                                                                                                                                                                                                                                                           
##  [90] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [91] ""                                                                                                                                                                                                                                                                                                                
##  [92] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                           
##  [93] "\"The Principal Principle,\" (2012), <I>Journal of Financial and QuantitativeAnalysis</I>, 47(6), 1215-1246.  "                                                                                                                                                                                                  
##  [94] "<a href=\"http://journals.cambridge.org/repo_A884JKBk\">[PDF]</a>"                                                                                                                                                                                                                                               
##  [95] "<br>[<I>Optimal approaches for mortgage loan modification. Principal reduction is optimal, and better than rate reductions, maturity extensions, and principal forebearance. Shared-appreciation mortgages solve moral hazard.</I>]"                                                                             
##  [96] "</LI>"                                                                                                                                                                                                                                                                                                           
##  [97] ""                                                                                                                                                                                                                                                                                                                
##  [98] "<LI><img src=\"graphics/IEEE.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                                 
##  [99] "\"Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study,\" (2011), (with Douglas Burdick, Mauricio A. Hernandez, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan), <I>IEEE Data Engineering Bulletin</I>, 34(3), 60-67."
## [100] "<a href=\"Papers/midaswww2011_FINAL.pdf\">[PDF older version]</a>"                                                                                                                                                                                                                                               
## [101] "<a href=\"Papers/midas-deb_July2011.pdf\">[PDF final version]</a>"                                                                                                                                                                                                                                               
## [102] ""                                                                                                                                                                                                                                                                                                                
## [103] "<LI><img src=\"graphics/jfint_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                          
## [104] "\"Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance,\" (2011), (with Hoje Jo and Yongtae Kim), "                                                                                                                                                                                    
## [105] "<I>Journal of Financial Intermediation</I> 20(2), 199--230."                                                                                                                                                                                                                                                     
## [106] "<a href=\"Papers/synd.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [107] "<br>[<I>Syndicate-financed firms fare better---higher return multiples come from better selection, but time-to-exit and likelihood of exit are better on accont of superior monitoring by the syndicate.</I>]"                                                                                                   
## [108] "</LI>"                                                                                                                                                                                                                                                                                                           
## [109] ""                                                                                                                                                                                                                                                                                                                
## [110] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\"> \"Portfolio"                                                                                                                                                                                                                                
## [111] "Optimization with Mental Accounts,\" (2010), (with Harry Markowitz, Jonathan"                                                                                                                                                                                                                                    
## [112] "Scheid, and Meir Statman),  <I>Journal of Financial and Quantitative"                                                                                                                                                                                                                                            
## [113] "Analysis</I>, v45(2), 311-334."                                                                                                                                                                                                                                                                                  
## [114] "<a href=\"http://journals.cambridge.org/repo_A772rEdS\">[PDF (copyright: Cambridge University Press)]</a>"                                                                                                                                                                                                       
## [115] "<br>[<I>Mean-variance optimization is reconciled with behavioral porfolio theory. Mental "                                                                                                                                                                                                                       
## [116] "accounts optimization leads to better aggregate portfolios.</I>]"                                                                                                                                                                                                                                                
## [117] "</LI>"                                                                                                                                                                                                                                                                                                           
## [118] ""                                                                                                                                                                                                                                                                                                                
## [119] "<LI><img src=\"graphics/jcr.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [120] "\"The Long and Short of it: Why are stocks with shorter run-lengths preferred?\" (2010), (with Priya Raghubir), <I>Journal of Consumer Research</I>. 36(6), 964-982."                                                                                                                                            
## [121] "<a href=\"Papers/runlength.pdf\">[PDF]</a>, "                                                                                                                                                                                                                                                                    
## [122] "<a href=\"Papers/runlength_summary.pdf\">[Non-technical summary]</a>"                                                                                                                                                                                                                                            
## [123] "<br>[<I>People responding to stock charts are systematically biased against stocks with longer run lengths, even if these stocks are no riskier than those with shorter runs.</I>]"                                                                                                                              
## [124] "</LI>"                                                                                                                                                                                                                                                                                                           
## [125] ""                                                                                                                                                                                                                                                                                                                
## [126] ""                                                                                                                                                                                                                                                                                                                
## [127] "<LI><img src=\"graphics/anor.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                  
## [128] "\"Run Lengths and Liquidity,\" (with Paul Hanouna), (2010), <I>Annals of Operations Resarch</I>, Special Issue on Risk and Uncertainty, 176(1), 127-152."                                                                                                                                                        
## [129] "<a href=\"Papers/rs.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                             
## [130] "<br>[<I>The run signature of a stock is shown to be mathematically related to liquidity. Runs are "                                                                                                                                                                                                              
## [131] "priced factors. </I>]"                                                                                                                                                                                                                                                                                           
## [132] "</LI>"                                                                                                                                                                                                                                                                                                           
## [133] ""                                                                                                                                                                                                                                                                                                                
## [134] ""                                                                                                                                                                                                                                                                                                                
## [135] ""                                                                                                                                                                                                                                                                                                                
## [136] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [137] "\"Implied Recovery,'' (with Paul Hanouna), (2009), <I>Journal of Economic Dynamics and Control</I>, 33(11), 1837-1857."                                                                                                                                                                                          
## [138] "<a href=\"Papers/imprec.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [139] "<br>[<I>How to use the term structure of CDS spreads to jointly identify the term structures of forward default probability and recovery rates.  </I>]"                                                                                                                                                          
## [140] "</LI>"                                                                                                                                                                                                                                                                                                           
## [141] ""                                                                                                                                                                                                                                                                                                                
## [142] ""                                                                                                                                                                                                                                                                                                                
## [143] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [144] "\"Accounting-based versus market-based cross-sectional models of CDS spreads,\" "                                                                                                                                                                                                                                
## [145] "(with Paul Hanouna and Atulya Sarin), (2009), "                                                                                                                                                                                                                                                                  
## [146] "<I>Journal of Banking and Finance</I>, 33, 719-730.  "                                                                                                                                                                                                                                                           
## [147] "<a href=\"Papers/JBF_final_3.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [148] "<br>[<I>Accounting models explain spreads as well as market-based ones, but a hybrid mix does best.</I>]"                                                                                                                                                                                                        
## [149] "</LI>"                                                                                                                                                                                                                                                                                                           
## [150] ""                                                                                                                                                                                                                                                                                                                
## [151] ""                                                                                                                                                                                                                                                                                                                
## [152] "<LI><img src=\"graphics/jfint_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                          
## [153] "\"Hedging Credit: Equity Liquidity Matters,\" (with Paul Hanouna), (2009),"                                                                                                                                                                                                                                      
## [154] "<I>Journal of Financial Intermediation</I>, v18(1), 112-123"                                                                                                                                                                                                                                                     
## [155] "<a href=\"Papers/cdsliq.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [156] "<br>[<I>Hedging in CDS markets provides a mechanism by which equity market liquidity impacts CDS spreads </I>]"                                                                                                                                                                                                  
## [157] "</LI>"                                                                                                                                                                                                                                                                                                           
## [158] ""                                                                                                                                                                                                                                                                                                                
## [159] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                              
## [160] "\"An Integrated Model for Hybrid Securities,\""                                                                                                                                                                                                                                                                  
## [161] "(with Raghu Sundaram), (2007), <I>Management Science</I>, v53, 1439-1451."                                                                                                                                                                                                                                       
## [162] "<a href=\"Papers/rsx_FINAL.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                      
## [163] "<br>[<I>A general flexible model for pricing derivative securities that depend on equity, "                                                                                                                                                                                                                      
## [164] "interest rate and credit risk, using observables. Delivers dynamic implied default probabilities.</I>]"                                                                                                                                                                                                          
## [165] "</LI>"                                                                                                                                                                                                                                                                                                           
## [166] ""                                                                                                                                                                                                                                                                                                                
## [167] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                              
## [168] "\"Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,\""                                                                                                                                                                                                                                          
## [169] "(with Mike Chen), (2007), <I>Management Science</I>, v53, 1375-1388."                                                                                                                                                                                                                                            
## [170] "<a href=\"Papers/chat_FINAL.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [171] "<br>[<I>A methodology for parsing internet stock chat to develop a sentiment index. Assesses"                                                                                                                                                                                                                    
## [172] "whether small traders opinions contain information not in prices. </I>]"                                                                                                                                                                                                                                         
## [173] "</LI>"                                                                                                                                                                                                                                                                                                           
## [174] ""                                                                                                                                                                                                                                                                                                                
## [175] "<LI><img src=\"graphics/JF_cover.jpg\" width=\"120\" height=\"55\">"                                                                                                                                                                                                                                             
## [176] "\"Common Failings: How Corporate Defaults are Correlated\" "                                                                                                                                                                                                                                                     
## [177] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."                                                                                                                                                                                                                                                        
## [178] "(2007) <I>Journal of Finance</I>, v62, 93-117. "                                                                                                                                                                                                                                                                 
## [179] "<a href=\"Papers/ddks.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [180] "<br>[<I>New approach to test for defaul contagion using a stochastic time change. "                                                                                                                                                                                                                              
## [181] "Doubly stochastic models are refuted by the data.</I>]"                                                                                                                                                                                                                                                          
## [182] "</LI>"                                                                                                                                                                                                                                                                                                           
## [183] ""                                                                                                                                                                                                                                                                                                                
## [184] "<LI><img src=\"graphics/fmalogo_main.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                          
## [185] "\"A Clinical Study of Investor Discussion and Sentiment,\" "                                                                                                                                                                                                                                                     
## [186] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "                                                                                                                                                                                                                                                             
## [187] "<I>Financial Management</I>, v34(5), 103-137."                                                                                                                                                                                                                                                                   
## [188] "<a href=\"Papers/einfo.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                          
## [189] "<br>[<I>Examines the interaction of chat room information and news. </I>]"                                                                                                                                                                                                                                       
## [190] "</LI>"                                                                                                                                                                                                                                                                                                           
## [191] ""                                                                                                                                                                                                                                                                                                                
## [192] "<LI><img src=\"graphics/JF_cover.jpg\" width=\"120\" height=\"55\">"                                                                                                                                                                                                                                             
## [193] "\"International Portfolio Choice with Systemic Risk,\""                                                                                                                                                                                                                                                          
## [194] "(with Raman Uppal), 2004, <I>Journal of Finance</I>, v59(6), 2809-2834."                                                                                                                                                                                                                                         
## [195] "<a href=\"Papers/systemic.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [196] "<br>[<I>A model for portfolio optimization with systemic risk. "                                                                                                                                                                                                                                                 
## [197] "The loss resulting from diminished diversification is small, while"                                                                                                                                                                                                                                              
## [198] "that from holding very highly levered positions is large. </I>]"                                                                                                                                                                                                                                                 
## [199] "</LI>"                                                                                                                                                                                                                                                                                                           
## [200] ""                                                                                                                                                                                                                                                                                                                
## [201] "<LI><img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\"> \"Fee"                                                                                                                                                                                                                                       
## [202] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"                                                                                                                                                                                                                                             
## [203] "Investor Welfare,'' (with Rangarajan Sundaram), 2002, <i>Review of"                                                                                                                                                                                                                                              
## [204] "Financial Studies</i>, v15, 1465-1497."                                                                                                                                                                                                                                                                          
## [205] "<a href=\"Papers/fees.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [206] "<br><I>[Compares fulcrum vs incentive fees structures from the standpoint of "                                                                                                                                                                                                                                   
## [207] "investor welfare. Contrary to regulatory intuition, incentive structures"                                                                                                                                                                                                                                        
## [208] "are often optimal.] </I>"                                                                                                                                                                                                                                                                                        
## [209] "</LI>"                                                                                                                                                                                                                                                                                                           
## [210] ""                                                                                                                                                                                                                                                                                                                
## [211] "<LI><img src=\"graphics/FAJ_cover.gif\" width=\"140\" height=\"55\">"                                                                                                                                                                                                                                            
## [212] "\"A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"                                                                                                                                                                                                                                        
## [213] "with Rating Transitions,\" (with Viral Acharya and Rangarajan Sundaram),"                                                                                                                                                                                                                                        
## [214] "2002, <I>Financial Analysts Journal</I>, May-June, 28-44."                                                                                                                                                                                                                                                       
## [215] "<a href=\"Papers/dsmarkov.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [216] "<br><I>[A HJM type two-factor model in risk free rates and spreads that also accounts "                                                                                                                                                                                                                          
## [217] "for rating transitions, allowing seamless pricing of many credit derivatives. ] </I>"                                                                                                                                                                                                                            
## [218] "</LI>"                                                                                                                                                                                                                                                                                                           
## [219] ""                                                                                                                                                                                                                                                                                                                
## [220] "<LI><img src=\"graphics/JOE_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [221] "\"The Surprise Element: Jumps in Interest Rates\", 2002, <I>Journal of"                                                                                                                                                                                                                                          
## [222] "Econometrics</I>, v106, 27-65."                                                                                                                                                                                                                                                                                  
## [223] "<a href=\"Papers/jump.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [224] "<br><I>[Estimation methodology for interest rates with jumps. A flexible "                                                                                                                                                                                                                                       
## [225] "specification that accommodates Federal Reserve Activity.]</I>"                                                                                                                                                                                                                                                  
## [226] "</LI>"                                                                                                                                                                                                                                                                                                           
## [227] ""                                                                                                                                                                                                                                                                                                                
## [228] "<LI><img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [229] "\"Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"                                                                                                                                                                                                                                 
## [230] "  2002, <I>Review of Financial Studies</I>, v15(1), 195-241."                                                                                                                                                                                                                                                    
## [231] "<a href=\"Papers/affine.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [232] "<br><I>[General affine option pricing for interest rate derivatives covering a "                                                                                                                                                                                                                                 
## [233] "wide range of securities, allowing for M factors with N diffusions and L jumps.] </I>"                                                                                                                                                                                                                           
## [234] "</LI>"                                                                                                                                                                                                                                                                                                           
## [235] ""                                                                                                                                                                                                                                                                                                                
## [236] "<LI><img src=\"graphics/MS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                              
## [237] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                                                                                                                                                                                                  
## [238] "(with Rangarajan Sundaram), 2000, <I>Management Science</I>, v46(1), 46-62."                                                                                                                                                                                                                                     
## [239] "<a href=\"msfinal.ps\">[PS]</a>"                                                                                                                                                                                                                                                                                 
## [240] "<br><I>[HJM style two factor model for credit risk. ] </I>"                                                                                                                                                                                                                                                      
## [241] "</LI>"                                                                                                                                                                                                                                                                                                           
## [242] ""                                                                                                                                                                                                                                                                                                                
## [243] "<LI><img src=\"graphics/FAJ_cover.gif\" width=\"140\" height=\"55\">"                                                                                                                                                                                                                                            
## [244] "\"The Psychology of Financial Decision Making: A Case"                                                                                                                                                                                                                                                           
## [245] "for Theory-Driven Experimental Enquiry,''"                                                                                                                                                                                                                                                                       
## [246] "1999, (with Priya Raghubir),"                                                                                                                                                                                                                                                                                    
## [247] "<I>Financial Analyst's Journal</I>, Nov-Dec 1999, v55(6), 56-79."                                                                                                                                                                                                                                                
## [248] "<br><I>[Surveys the anomalies literature in Finance and shows how experimental"                                                                                                                                                                                                                                  
## [249] "studies may be used to disentangle competing hypotheses for the same anomaly.]</I>"                                                                                                                                                                                                                              
## [250] "</LI>"                                                                                                                                                                                                                                                                                                           
## [251] ""                                                                                                                                                                                                                                                                                                                
## [252] "<LI><img src=\"graphics/JFQA_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [253] "\"Of Smiles and Smirks: A Term Structure Perspective,''"                                                                                                                                                                                                                                                         
## [254] "1999, (with Rangarajan Sundaram), <I>Journal of"                                                                                                                                                                                                                                                                 
## [255] "Financial and Quantitative Analysis</I>, v34(2), 211-240."                                                                                                                                                                                                                                                       
## [256] "<a href=\"Papers/skew.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [257] "<br><I>[Explains how the shape of the volatility smile is determined by "                                                                                                                                                                                                                                        
## [258] "jumps and stochastic volatility. ]</I>"                                                                                                                                                                                                                                                                          
## [259] "</LI>"                                                                                                                                                                                                                                                                                                           
## [260] ""                                                                                                                                                                                                                                                                                                                
## [261] "<LI><img src=\"graphics/JBF_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [262] "\"A Theory of Banking Structure,\" 1999, (with Ashish Nanda),"                                                                                                                                                                                                                                                   
## [263] "<I>Journal of Banking and Finance</I>, v23(6), 863-895."                                                                                                                                                                                                                                                         
## [264] "<br><I>[A theory to analyze the specialization of banking activities based "                                                                                                                                                                                                                                     
## [265] "by function based upon two dimensions: the degree of information asymmetry "                                                                                                                                                                                                                                     
## [266] "and the degree of verifiability of the value of the service rendered. ]</I>"                                                                                                                                                                                                                                     
## [267] "</LI>"                                                                                                                                                                                                                                                                                                           
## [268] ""                                                                                                                                                                                                                                                                                                                
## [269] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [270] "\"A Theory of Optimal Timing and Selectivity,'' "                                                                                                                                                                                                                                                                
## [271] "(with George Chacko), 1999, <I>Journal of"                                                                                                                                                                                                                                                                       
## [272] "Economic Dynamics and Control</I>, v23(7), 929-966."                                                                                                                                                                                                                                                             
## [273] "<br><I>[Dynamic optimal portfolio choice model for determining optimal effort"                                                                                                                                                                                                                                   
## [274] "allocation to timing and stock selection in asset allocation.]</I>"                                                                                                                                                                                                                                              
## [275] "</LI>"                                                                                                                                                                                                                                                                                                           
## [276] ""                                                                                                                                                                                                                                                                                                                
## [277] "<LI><img src=\"graphics/JEDC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [278] "\"A Direct Discrete-Time Approach to"                                                                                                                                                                                                                                                                            
## [279] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "                                                                                                                                                                                                                                                
## [280] "Model,\" 1999, <I>Journal of Economic Dynamics and Control</I>, v23(3), 333-369."                                                                                                                                                                                                                                
## [281] "<br><I>[HJM tree with jumps. Fast, fully recombining dynamics. ] </I>"                                                                                                                                                                                                                                           
## [282] "</LI>"                                                                                                                                                                                                                                                                                                           
## [283] ""                                                                                                                                                                                                                                                                                                                
## [284] "<LI><img src=\"graphics/RESTAT_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                          
## [285] "\"The Central Tendency: A Second Factor in"                                                                                                                                                                                                                                                                      
## [286] "Bond Yields,\" 1998, (with Silverio Foresi and Pierluigi Balduzzi),  "                                                                                                                                                                                                                                           
## [287] "<I>The Review of Economics and Statistics</I>, v80(1), 60-72."                                                                                                                                                                                                                                                   
## [288] "<br><I>[Model of the term structure with stochastic long-run mean. Related to "                                                                                                                                                                                                                                  
## [289] "Federal Reserve acitivity.]</I>"                                                                                                                                                                                                                                                                                 
## [290] "<a href=\"Papers/BalduzziDasForesi_ReStat1998_CentralTendency.pdf\">[PDF]</a>"                                                                                                                                                                                                                                   
## [291] "</LI>"                                                                                                                                                                                                                                                                                                           
## [292] ""                                                                                                                                                                                                                                                                                                                
## [293] "<LI> <img src=\"graphics/RFS_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [294] "\"Efficiency with Costly Information: A Reinterpretation of"                                                                                                                                                                                                                                                     
## [295] "Evidence from Managed Portfolios,\" (with Edwin Elton, Martin Gruber and Matt "                                                                                                                                                                                                                                  
## [296] "Hlavka), <I>Review of Financial Studies</I>, vol. 6(1), 1993, pp 1-22. "                                                                                                                                                                                                                                         
## [297] "<a href=\"Papers/EGDH.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                           
## [298] "<br><I>[Mutual funds are not informationally efficient. "                                                                                                                                                                                                                                                        
## [299] "You are better off buying the index.]  </I>"                                                                                                                                                                                                                                                                     
## [300] "<br>"                                                                                                                                                                                                                                                                                                            
## [301] "Presented and Reprinted in the Proceedings of The "                                                                                                                                                                                                                                                              
## [302] "Seminar on the Analysis of Security Prices at the Center "                                                                                                                                                                                                                                                       
## [303] "for Research in Security   Prices  at the University of "                                                                                                                                                                                                                                                        
## [304] "Chicago, Graduate School of Business. </LI>"                                                                                                                                                                                                                                                                     
## [305] ""                                                                                                                                                                                                                                                                                                                
## [306] ""                                                                                                                                                                                                                                                                                                                
## [307] ""                                                                                                                                                                                                                                                                                                                
## [308] ""                                                                                                                                                                                                                                                                                                                
## [309] ""                                                                                                                                                                                                                                                                                                                
## [310] "<H2>MORE REFEREED JOURNAL PUBLICATIONS</H2>"                                                                                                                                                                                                                                                                     
## [311] ""                                                                                                                                                                                                                                                                                                                
## [312] ""                                                                                                                                                                                                                                                                                                                
## [313] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                            
## [314] "\"Managing Rollover Risk with Capital Structure Covenants"                                                                                                                                                                                                                                                       
## [315] "in Structured Finance Vehicles\" (2016),"                                                                                                                                                                                                                                                                        
## [316] "(with Seoyoung Kim), forthcoming <I>Journal of Fixed Income</I>."                                                                                                                                                                                                                                                
## [317] "<a href=\"Papers/siv_JFI.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                        
## [318] "<br><I>[We propose a covenant-based capital structure that mitigates rollover problems in SIVs and is Pareto-improving for equity and debt holders in the SPV.]</I>"                                                                                                                                             
## [319] "</LI>"                                                                                                                                                                                                                                                                                                           
## [320] ""                                                                                                                                                                                                                                                                                                                
## [321] ""                                                                                                                                                                                                                                                                                                                
## [322] ""                                                                                                                                                                                                                                                                                                                
## [323] "<LI><img src=\"graphics/JRFM.png\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                                 
## [324] "\"The Design and Risk Management of Structured Finance Vehicles\" (2016),"                                                                                                                                                                                                                                       
## [325] "(with Seoyoung Kim), forthcoming, <I>Journal of Risk and Financial Management</I>, Special Issue on Credit Risk."                                                                                                                                                                                                
## [326] "<a href=\"Papers/siv_JRFM.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [327] "<br><I>[Risk management for special investment vehicles is difficult, but necessary. "                                                                                                                                                                                                                           
## [328] "Post the recent subprime financial crisis, we inform the creation of safer SIVs "                                                                                                                                                                                                                                
## [329] "in structured finance, and propose avenues of mitigating risks faced by senior debt through "                                                                                                                                                                                                                    
## [330] "deleveraging policies in the form of leverage risk controls and contingent capital.]</I>"                                                                                                                                                                                                                        
## [331] "</LI>"                                                                                                                                                                                                                                                                                                           
## [332] ""                                                                                                                                                                                                                                                                                                                
## [333] ""                                                                                                                                                                                                                                                                                                                
## [334] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [335] "\"Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework\" (2014), "                                                                                                                                                                                                                            
## [336] "(with Seoyoung Kim and Meir Statman),  "                                                                                                                                                                                                                                                                         
## [337] "<I>Journal of Portfolio Management</I>, 41(1), 95-108."                                                                                                                                                                                                                                                          
## [338] "<a href=\"Papers/underfunded.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [339] "<br><I>[Provides a new definition of underfunded portfolios, and compares four remedies for underfunding.]</I>"                                                                                                                                                                                                  
## [340] "</LI>"                                                                                                                                                                                                                                                                                                           
## [341] ""                                                                                                                                                                                                                                                                                                                
## [342] ""                                                                                                                                                                                                                                                                                                                
## [343] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                            
## [344] "\"Going for Broke: Restructuring Distressed Debt Portfolios\" (2014),"                                                                                                                                                                                                                                           
## [345] "(with Seoyoung Kim), <I>Journal of Fixed Income</I>, 24(3), 5-27."                                                                                                                                                                                                                                               
## [346] "<a href=\"Papers/ddo.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                            
## [347] "<br><I>[Optimizing portfolios where the return distributions of the assets is endogenous. The gains from restructuring distressed debt portfolios are large.]</I>"                                                                                                                                               
## [348] "</LI>"                                                                                                                                                                                                                                                                                                           
## [349] ""                                                                                                                                                                                                                                                                                                                
## [350] ""                                                                                                                                                                                                                                                                                                                
## [351] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [352] "\"Digital Portfolios.\" (2013), "                                                                                                                                                                                                                                                                                
## [353] "<I>Journal of Portfolio Management</I>, v39(2), 41-48."                                                                                                                                                                                                                                                          
## [354] "<a href=\"Papers/vport.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                          
## [355] "<br><I>[Constructing portfolios of assets with a binary payoff, large versus zero, and the differences in this optimization versus standard mean-variance portfolio construction.]</I>"                                                                                                                          
## [356] "</LI>"                                                                                                                                                                                                                                                                                                           
## [357] ""                                                                                                                                                                                                                                                                                                                
## [358] ""                                                                                                                                                                                                                                                                                                                
## [359] "<LI><img src=\"graphics/frl.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [360] "\"Options on Portfolios with Higher-Order Moments,\" (2009),"                                                                                                                                                                                                                                                    
## [361] "(with Rishabh Bhandari), <I>Finance Research Letters</I>, v6, 122-129. "                                                                                                                                                                                                                                         
## [362] "<a href=\"Papers/tensor.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [363] "<br><I>[How to model fat-tailed portfolio distributions for "                                                                                                                                                                                                                                                    
## [364] "options on a multivariate system of assets, calibrated to the return "                                                                                                                                                                                                                                           
## [365] "means, covariance matrix, coskewness and cokurtosis tensors.]</I>"                                                                                                                                                                                                                                               
## [366] "</LI>"                                                                                                                                                                                                                                                                                                           
## [367] ""                                                                                                                                                                                                                                                                                                                
## [368] ""                                                                                                                                                                                                                                                                                                                
## [369] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [370] "\"Dealing with Dimension: Option Pricing on Factor Trees,\" (2009),"                                                                                                                                                                                                                                             
## [371] "(with Brian Granger), <I>Journal of Investment Management</I>, 7(2), 73-85."                                                                                                                                                                                                                                     
## [372] "<a href=\"Papers/faclat.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [373] "<br><I>[Multifactor representations of securities on high-dimensional trees. Allows "                                                                                                                                                                                                                            
## [374] "you to price options on multiple assets in a unified fraamework. Computational"                                                                                                                                                                                                                                  
## [375] "results assess using multithreading.]</I>"                                                                                                                                                                                                                                                                       
## [376] "</LI>"                                                                                                                                                                                                                                                                                                           
## [377] ""                                                                                                                                                                                                                                                                                                                
## [378] ""                                                                                                                                                                                                                                                                                                                
## [379] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\"> "                                                                                                                                                                                                                                            
## [380] "\"Modeling"                                                                                                                                                                                                                                                                                                      
## [381] "Correlated Default with a Forest of Binomial Trees,\" (2007), (with"                                                                                                                                                                                                                                             
## [382] "Santhosh Bandreddi and Rong Fan), <I>Journal of Fixed"                                                                                                                                                                                                                                                           
## [383] "Income</I>. Winter, 1-20."                                                                                                                                                                                                                                                                                       
## [384] "<a href=\"Papers/bscorrdef.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                      
## [385] "<br><I>[Extends the Das-Sundaram hybrid securities model to correlated default modeling.  ]</I>"                                                                                                                                                                                                                 
## [386] "</LI>"                                                                                                                                                                                                                                                                                                           
## [387] ""                                                                                                                                                                                                                                                                                                                
## [388] "<LI><img src=\"graphics/jfsr_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [389] "\"Basel II: Correlation Related Issues\" (2007), "                                                                                                                                                                                                                                                               
## [390] "<I>Journal of Financial Services Research</I>, v32, 17-38."                                                                                                                                                                                                                                                      
## [391] "<a href=\"Papers/Das_JFSR2007_Basel2.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                            
## [392] "<br><I>[Analysis of correlation related issues arising in the implementation"                                                                                                                                                                                                                                    
## [393] "of the Basel II accord.]</I>"                                                                                                                                                                                                                                                                                    
## [394] "</LI>"                                                                                                                                                                                                                                                                                                           
## [395] ""                                                                                                                                                                                                                                                                                                                
## [396] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [397] "\"Correlated Default Risk,\" (2006),"                                                                                                                                                                                                                                                                            
## [398] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                                                                                                                                                                                                           
## [399] "<I>Journal of Fixed Income</I>, Fall 2006, 7-32."                                                                                                                                                                                                                                                                
## [400] "<a href=\"Papers/DasFreedGengKapadia_JFI2006.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                    
## [401] "<br><I>[Empirical evidence on the nature of credit correlations. Correlations"                                                                                                                                                                                                                                   
## [402] "increase as markets worsen. Regime switching models are needed to explain dynamic"                                                                                                                                                                                                                               
## [403] "correlations.]</I>"                                                                                                                                                                                                                                                                                              
## [404] "</LI>"                                                                                                                                                                                                                                                                                                           
## [405] ""                                                                                                                                                                                                                                                                                                                
## [406] "<LI><img src=\"graphics/qfcover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                               
## [407] "\"A Simple Model for Pricing Equity Options with Markov"                                                                                                                                                                                                                                                         
## [408] "Switching State Variables\" (2006),"                                                                                                                                                                                                                                                                             
## [409] "(with Donald Aingworth and Rajeev Motwani),"                                                                                                                                                                                                                                                                     
## [410] "<I>Quantitative Finance</I>, v6(2), 95-105."                                                                                                                                                                                                                                                                     
## [411] "<a href=\"Papers/switch.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [412] "<br><I>[A tree model for options when the underlying has regime switches.]</I>"                                                                                                                                                                                                                                  
## [413] "</LI>"                                                                                                                                                                                                                                                                                                           
## [414] ""                                                                                                                                                                                                                                                                                                                
## [415] "<LI><img src=\"graphics/mktletters.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [416] "\"The Firm's Management of Social Interactions,\" (2005)"                                                                                                                                                                                                                                                        
## [417] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "                                                                                                                                                                                                                                                    
## [418] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "                                                                                                                                                                                                                                                       
## [419] "<I>Marketing Letters</I>, v16, 415-428.Ê"                                                                                                                                                                                                                                                                        
## [420] "<br><I>[A framework for how word-of-mouth communication is modeled in "                                                                                                                                                                                                                                          
## [421] "the practice of marketing.   ]</I>"                                                                                                                                                                                                                                                                              
## [422] "</LI>"                                                                                                                                                                                                                                                                                                           
## [423] ""                                                                                                                                                                                                                                                                                                                
## [424] "<LI><img src=\"graphics/jpm_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [425] "\"Financial Communities\" (with Jacob Sisk), 2005, "                                                                                                                                                                                                                                                             
## [426] "<i>Journal of Portfolio Management</i>, v31(4), "                                                                                                                                                                                                                                                                
## [427] "Summer, 112-123."                                                                                                                                                                                                                                                                                                
## [428] "<a href=\"Papers/fincom.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [429] "<br><I>[Applying graph theory to understanding investor networks to "                                                                                                                                                                                                                                            
## [430] "develop trading rules. ]</I>"                                                                                                                                                                                                                                                                                    
## [431] "</LI>"                                                                                                                                                                                                                                                                                                           
## [432] ""                                                                                                                                                                                                                                                                                                                
## [433] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [434] "\"Monte Carlo Markov Chain Methods for Derivative Pricing"                                                                                                                                                                                                                                                       
## [435] "and Risk Assessment,\"(with Alistair Sinclair), 2005, "                                                                                                                                                                                                                                                          
## [436] "<I>Journal of Investment Management</I>, v3(1), 29-44. "                                                                                                                                                                                                                                                         
## [437] "<a href=\"https://www.joim.com/ArticleContainer.asp?artid=125&print=false&Key=GQ6!WiJQSJrlrcVJSoeGhEQF7LVNhzfb0M!Nz!0SO5foSMK6!WiHQSJrlrcVJSoeGhEQ\">[PDF]</a>"                                                                                                                                                  
## [438] "<br><I>[Randomized algorithm using MCMC on very large option pricing trees"                                                                                                                                                                                                                                      
## [439] "where incomplete information about the value of an asset may be exploited to "                                                                                                                                                                                                                                   
## [440] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "                                                                                                                                                                                                                                  
## [441] "approximation scheme (FPRAS) is available.]</I>"                                                                                                                                                                                                                                                                 
## [442] "</LI>"                                                                                                                                                                                                                                                                                                           
## [443] ""                                                                                                                                                                                                                                                                                                                
## [444] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [445] "\"Correlated Default Processes: A Criterion-Based Copula Approach,\""                                                                                                                                                                                                                                            
## [446] "(with Gary Geng), 2004, <I>Journal of Investment Management</I>, v2(2), 44-70,"                                                                                                                                                                                                                                  
## [447] "Special Issue on Default Risk. "                                                                                                                                                                                                                                                                                 
## [448] "<a href=\"https://www.joim.com/ArticleContainer.asp?artid=70&print=false&Key=GQ6!WiJQSJrlrcVJSoeGhEJF7LVNhzfb0M!Nz!0SO5foSMK6!WiHQSJrlrcVJSoeGhEJ\">[PDF]</a>"                                                                                                                                                   
## [449] "<br><I>[Which copula and marginal distributions best describe default probability"                                                                                                                                                                                                                               
## [450] "correlations? Develops models and methodology to answer this question. ]</I>"                                                                                                                                                                                                                                    
## [451] "</LI>"                                                                                                                                                                                                                                                                                                           
## [452] ""                                                                                                                                                                                                                                                                                                                
## [453] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [454] "\"Private Equity Returns: An Empirical Examination of the Exit of"                                                                                                                                                                                                                                               
## [455] "Venture-Backed Companies,\" (with Murali Jagannathan and Atulya Sarin),"                                                                                                                                                                                                                                         
## [456] "2003, <I>Journal of Investment Management</I>, v1(1), 152-177."                                                                                                                                                                                                                                                  
## [457] "<a href=\"Papers/PE_returns.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [458] "<br><I>[Gains from venture-backed investments depend upon the industry, the stage of the"                                                                                                                                                                                                                        
## [459] "firm being financed, the valuation at the time of financing, and the prevailing market"                                                                                                                                                                                                                          
## [460] "sentiment. Helps understand the risk premium required for the"                                                                                                                                                                                                                                                   
## [461] "valuation of private equity investments  ]</I>"                                                                                                                                                                                                                                                                  
## [462] "</LI>"                                                                                                                                                                                                                                                                                                           
## [463] ""                                                                                                                                                                                                                                                                                                                
## [464] "<LI><img src=\"graphics/IJISAFM_cover.gif\" width=\"40\" height=\"55\"> \"A"                                                                                                                                                                                                                                     
## [465] "Numerical Algorithm for Consumption/Investment Problems,\" (with Rangarajan"                                                                                                                                                                                                                                     
## [466] "Sundaram), 2002, <I>International Journal of Intelligent"                                                                                                                                                                                                                                                        
## [467] "Systems in Accounting, Finance and Management</I>, (Special"                                                                                                                                                                                                                                                     
## [468] "Issue on Computational Methods in Economics and Finance),  "                                                                                                                                                                                                                                                     
## [469] "December, 55-69."                                                                                                                                                                                                                                                                                                
## [470] "<a href=\"Papers/hjb.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                            
## [471] "<br><I>[A simple regression approach to solving optimal consumption"                                                                                                                                                                                                                                             
## [472] "and portfolio problems wit diffusions and jumps.]</I>"                                                                                                                                                                                                                                                           
## [473] "</LI>"                                                                                                                                                                                                                                                                                                           
## [474] ""                                                                                                                                                                                                                                                                                                                
## [475] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [476] "\"Bayesian Migration in Credit Ratings Based on Probabilities of"                                                                                                                                                                                                                                                
## [477] "Default,\" (with Rong Fan and Gary Geng), 2002, <I>Journal of"                                                                                                                                                                                                                                                   
## [478] "Fixed Income</I>, December, v12(3), 17-23.  "                                                                                                                                                                                                                                                                    
## [479] "<a href=\"Papers/ratingmigr.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [480] "<br><I>[Bayesian model for predicting rating changes based on the"                                                                                                                                                                                                                                               
## [481] "dynamics of default probabilities.]</I>"                                                                                                                                                                                                                                                                         
## [482] "</LI>"                                                                                                                                                                                                                                                                                                           
## [483] ""                                                                                                                                                                                                                                                                                                                
## [484] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [485] "\"The Impact of Correlated Default Risk on Credit Portfolios,\""                                                                                                                                                                                                                                                 
## [486] "(with Gifford Fong, and Gary Geng),"                                                                                                                                                                                                                                                                             
## [487] "2001, <i>Journal of Fixed Income</i>, v11(3), 9-19."                                                                                                                                                                                                                                                             
## [488] "<br><I>[The connection between credit portfolio loss distributions"                                                                                                                                                                                                                                              
## [489] "and credit correlations. ]</I>"                                                                                                                                                                                                                                                                                  
## [490] "</LI>"                                                                                                                                                                                                                                                                                                           
## [491] ""                                                                                                                                                                                                                                                                                                                
## [492] "<LI><img src=\"graphics/CIR_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [493] "\"How Diversified are Internationally Diversified Portfolios:"                                                                                                                                                                                                                                                   
## [494] "Time-Variation in the Covariances between International Returns,\""                                                                                                                                                                                                                                              
## [495] "1998, (with Raman Uppal), <I>Canadian Investment Review</I>, Spring, 7-11."                                                                                                                                                                                                                                      
## [496] "<a href=\"Papers/DasUppalCIR1998.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                
## [497] "<br><I>[Internation portfolio risk has systemic components.   ]</I>"                                                                                                                                                                                                                                             
## [498] "</LI>     "                                                                                                                                                                                                                                                                                                      
## [499] ""                                                                                                                                                                                                                                                                                                                
## [500] "<LI><img src=\"graphics/REDR_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [501] "\"Discrete-Time Bond and Option Pricing for Jump-Diffusion"                                                                                                                                                                                                                                                      
## [502] "Processes,\" 1997, <I>Review of Derivatives Research</I>, v1(3), 211-244. "                                                                                                                                                                                                                                      
## [503] "<br><I>[Extends the finite-differencing approach for interest rate derivatives"                                                                                                                                                                                                                                  
## [504] "to jump processes.]</I>"                                                                                                                                                                                                                                                                                         
## [505] "</LI>"                                                                                                                                                                                                                                                                                                           
## [506] ""                                                                                                                                                                                                                                                                                                                
## [507] "<LI><img src=\"graphics/AEL_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [508] "\"Macroeconomic Implications of Search Theory for the Labor Market,\""                                                                                                                                                                                                                                           
## [509] "1997, <I>Applied Economics Letters</I>, December, v4, 719-723."                                                                                                                                                                                                                                                  
## [510] "<br><I>[Connects option pricing theory to labor search theory. Calibrates to "                                                                                                                                                                                                                                   
## [511] "labor market data.]</I>"                                                                                                                                                                                                                                                                                         
## [512] "</LI>"                                                                                                                                                                                                                                                                                                           
## [513] ""                                                                                                                                                                                                                                                                                                                
## [514] "<LI> <img src=\"graphics/FMII_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [515] "\"Auction Theory: A Summary with Applications and Evidence"                                                                                                                                                                                                                                                      
## [516] "from the Treasury Markets,\" 1996, (with Rangarajan Sundaram),"                                                                                                                                                                                                                                                  
## [517] "<I>Financial Markets, Institutions and Instruments</I>, v5(5), 1-36."                                                                                                                                                                                                                                            
## [518] "<a href=\"Papers/DasSundaram_FMII1996_AuctionTheory.pdf\">[PDF]</a>"                                                                                                                                                                                                                                             
## [519] "<br><I>[A survey of models and literature on Treasury Auctions. ]</I>"                                                                                                                                                                                                                                           
## [520] "</LI>"                                                                                                                                                                                                                                                                                                           
## [521] ""                                                                                                                                                                                                                                                                                                                
## [522] "<LI><img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [523] "\"A Simple Approach to Three Factor Affine Models of the"                                                                                                                                                                                                                                                        
## [524] "Term Structure,\" (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"                                                                                                                                                                                                                                      
## [525] "Sundaram), 1996, <I>Journal of Fixed Income</I>, v6(3), 43-53."                                                                                                                                                                                                                                                  
## [526] "<br><I>[ An easy way to calibrate three factor models using method of moments.   ]</I>"                                                                                                                                                                                                                          
## [527] "</LI>"                                                                                                                                                                                                                                                                                                           
## [528] ""                                                                                                                                                                                                                                                                                                                
## [529] "<LI> <img src=\"graphics/JFI_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [530] "\"Analytical Approximations of  the Term Structure"                                                                                                                                                                                                                                                              
## [531] "for Jump-diffusion Processes: A Numerical Analysis,\" 1996, "                                                                                                                                                                                                                                                    
## [532] "(with Jamil Baz), <I>Journal of Fixed Income</I>, v6(1), 78-86. "                                                                                                                                                                                                                                                
## [533] "<br><I>[An exact solution to an approximate PDE may be better than "                                                                                                                                                                                                                                             
## [534] "an approximate solution to an exact PDDE for term structure models. ]</I>"                                                                                                                                                                                                                                       
## [535] "</LI>"                                                                                                                                                                                                                                                                                                           
## [536] ""                                                                                                                                                                                                                                                                                                                
## [537] "<LI> <img src=\"graphics/JAF_cover.jpg\" width=\"40\" height=\"55\"> \"Revisiting"                                                                                                                                                                                                                               
## [538] "Markov Chain Term Structure Models: Extensions and Applications,\""                                                                                                                                                                                                                                              
## [539] "1996, <I>Financial Practice and Education</I>, v6(1), 33-45. "                                                                                                                                                                                                                                                   
## [540] "<br><I>[A new pedagogy for Markov models of interest rates.  ]</I>"                                                                                                                                                                                                                                              
## [541] "</LI>"                                                                                                                                                                                                                                                                                                           
## [542] ""                                                                                                                                                                                                                                                                                                                
## [543] ""                                                                                                                                                                                                                                                                                                                
## [544] "<LI> <img src=\"graphics/REDR_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [545] "\"Exact Solutions for Bond and Options Prices"                                                                                                                                                                                                                                                                   
## [546] "with Systematic Jump Risk,\" 1996, (with Silverio Foresi),"                                                                                                                                                                                                                                                      
## [547] "<I>Review of Derivatives Research</I>, v1(1), 7-24. "                                                                                                                                                                                                                                                            
## [548] "<a href=\"Papers/DasForesiREDR1996.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                              
## [549] "<br><I>[First paper to show that affine solutions exist for "                                                                                                                                                                                                                                                    
## [550] "jump-diffusion term structure models.]</I>"                                                                                                                                                                                                                                                                      
## [551] "</LI>"                                                                                                                                                                                                                                                                                                           
## [552] ""                                                                                                                                                                                                                                                                                                                
## [553] "<LI> <img src=\"graphics/JOD_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [554] "\"Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"                                                                                                                                                                                                                                             
## [555] "and Credit Spreads are Stochastic,\" 1996, "                                                                                                                                                                                                                                                                     
## [556] "(with Peter Tufano), <I>The Journal of Financial Engineering</I>,"                                                                                                                                                                                                                                               
## [557] "v5(2), 161-198."                                                                                                                                                                                                                                                                                                 
## [558] "<a href=\"Papers/DasTufanoJFE1996.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                               
## [559] "<br><I>[Rating based model for credit derivatives with correlation between recovery "                                                                                                                                                                                                                            
## [560] "rates, interest rates and default probabilities. ]</I>"                                                                                                                                                                                                                                                          
## [561] "</LI>"                                                                                                                                                                                                                                                                                                           
## [562] ""                                                                                                                                                                                                                                                                                                                
## [563] "<LI> <img src=\"graphics/JOD_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [564] "\"Credit Risk Derivatives,\" <I>Journal of Derivatives</I>, 1995, pg 7-21. "                                                                                                                                                                                                                                     
## [565] "<a href=\"Papers/Das-JOD1995.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [566] "<br><I>[Introduces early models for pricing credit derivatives as compound options.  ]</I>"                                                                                                                                                                                                                      
## [567] "</LI>"                                                                                                                                                                                                                                                                                                           
## [568] ""                                                                                                                                                                                                                                                                                                                
## [569] ""                                                                                                                                                                                                                                                                                                                
## [570] ""                                                                                                                                                                                                                                                                                                                
## [571] ""                                                                                                                                                                                                                                                                                                                
## [572] ""                                                                                                                                                                                                                                                                                                                
## [573] "<H2>SHORTER ARTICLES and BOOK CHAPTERS (Mostly Non-refereed, Invited)</H2>"                                                                                                                                                                                                                                      
## [574] ""                                                                                                                                                                                                                                                                                                                
## [575] "<LI><img src=\"graphics/fame.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                  
## [576] "\"Did CDS Trading Improve the Market for Corporate Bonds,\" (2016), "                                                                                                                                                                                                                                            
## [577] "(with Madhu Kalimipalli and Subhankar Nayak), "                                                                                                                                                                                                                                                                  
## [578] "<I>Finance and Accounting Memos</I> Issue 3, 45--49. "                                                                                                                                                                                                                                                           
## [579] "<a href=\"Papers/fame-3.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [580] "<br><I>[CDS trading adversely impacted the bond market.]</I>"                                                                                                                                                                                                                                                    
## [581] "</LI> "                                                                                                                                                                                                                                                                                                          
## [582] ""                                                                                                                                                                                                                                                                                                                
## [583] "<LI><img src=\"graphics/FD.png\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                    
## [584] "\"Big Data's Big Muscle,\" (2016), "                                                                                                                                                                                                                                                                             
## [585] "<I>Finance and Development (IMF)</I>, September, 14(2), 26-28."                                                                                                                                                                                                                                                  
## [586] "<a href=\"Papers/FD_BigData.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                     
## [587] "<br><I>[Economics in the machine age.]</I>"                                                                                                                                                                                                                                                                      
## [588] "</LI> "                                                                                                                                                                                                                                                                                                          
## [589] ""                                                                                                                                                                                                                                                                                                                
## [590] ""                                                                                                                                                                                                                                                                                                                
## [591] "<LI><img src=\"graphics/jwm.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [592] "\"Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier,\" (2011), "                                                                                                                                                                                      
## [593] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "                                                                                                                                                                                                                                                     
## [594] "<I>Journal of Wealth Management</I>, Fall, 14(2), 25-31."                                                                                                                                                                                                                                                        
## [595] "<br><I>[A framework for goal driven mental accounting and behavioral portfolio allocation that extends mean-variance portfolios.]</I>"                                                                                                                                                                           
## [596] "</LI> "                                                                                                                                                                                                                                                                                                          
## [597] ""                                                                                                                                                                                                                                                                                                                
## [598] "<LI><img src=\"graphics/HNAF_Wiley.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [599] "\"News Analytics: Framework, Techniques and Metrics,\" The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "                                                                                                                                                                            
## [600] "<a href=\"Papers/newsmetrics.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                    
## [601] "</LI>"                                                                                                                                                                                                                                                                                                           
## [602] ""                                                                                                                                                                                                                                                                                                                
## [603] ""                                                                                                                                                                                                                                                                                                                
## [604] ""                                                                                                                                                                                                                                                                                                                
## [605] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [606] "\"Random Lattices for Option Pricing Problems in Finance,\" (2011),"                                                                                                                                                                                                                                             
## [607] "<I>Journal of Investment Management</I>, 9(2), 88-106."                                                                                                                                                                                                                                                          
## [608] "<a href=\"Papers/randlatt.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                       
## [609] "</LI>"                                                                                                                                                                                                                                                                                                           
## [610] ""                                                                                                                                                                                                                                                                                                                
## [611] ""                                                                                                                                                                                                                                                                                                                
## [612] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [613] "\"Implementing Option Pricing Models using Python and Cython,\" (2010),"                                                                                                                                                                                                                                         
## [614] "(with Brian Granger), <I>Journal of Investment Management</I>, 9(4), 72-84"                                                                                                                                                                                                                                      
## [615] "<a href=\"Papers/cython.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [616] "</LI>"                                                                                                                                                                                                                                                                                                           
## [617] ""                                                                                                                                                                                                                                                                                                                
## [618] ""                                                                                                                                                                                                                                                                                                                
## [619] ""                                                                                                                                                                                                                                                                                                                
## [620] "<LI><img src=\"graphics/IEEE_IS_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                         
## [621] "\"The Finance Web: Internet Information and Markets,\" (2010), "                                                                                                                                                                                                                                                 
## [622] "<I>IEEE Intelligent Systems</I>, 25(2), Mar/Apr, 74--78. "                                                                                                                                                                                                                                                       
## [623] "</LI>"                                                                                                                                                                                                                                                                                                           
## [624] ""                                                                                                                                                                                                                                                                                                                
## [625] ""                                                                                                                                                                                                                                                                                                                
## [626] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [627] "\"Financial Applications with Parallel R,\" (2009), "                                                                                                                                                                                                                                                            
## [628] "(with Brian Granger), <I>Journal of Investment Management</I>, 7(4), 66-77"                                                                                                                                                                                                                                      
## [629] "<a href=\"Papers/parallelr_options.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                              
## [630] "</LI>"                                                                                                                                                                                                                                                                                                           
## [631] ""                                                                                                                                                                                                                                                                                                                
## [632] ""                                                                                                                                                                                                                                                                                                                
## [633] "<LI><img src=\"graphics/EQF.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [634] "\"Recovery Swaps,\" (2009), (with Paul Hanouna),  "                                                                                                                                                                                                                                                              
## [635] "<I>Encyclopedia of Quantitative Finance</I>, John Wiley and Sons, U.K., 1507--1509 "                                                                                                                                                                                                                             
## [636] ""                                                                                                                                                                                                                                                                                                                
## [637] "<LI><img src=\"graphics/EQF.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                                   
## [638] "\"Recovery Rates,\" (2009),(with Paul Hanouna), "                                                                                                                                                                                                                                                                
## [639] "<I>Encyclopedia of Quantitative Finance</I>, John Wiley and Sons, U.K., 1505--1507"                                                                                                                                                                                                                              
## [640] ""                                                                                                                                                                                                                                                                                                                
## [641] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [642] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "                                                                                                                                                                                                                                 
## [643] "<I> Innovations in Investment Management</I>, Bloomberg Press, 85-112."                                                                                                                                                                                                                                          
## [644] ""                                                                                                                                                                                                                                                                                                                
## [645] ""                                                                                                                                                                                                                                                                                                                
## [646] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [647] "\"Credit Default Swap Spreads\", 2006, (with Paul Hanouna), "                                                                                                                                                                                                                                                    
## [648] "<I>Journal of Investment Management</I>, v4(3), 93-105."                                                                                                                                                                                                                                                         
## [649] "</LI>"                                                                                                                                                                                                                                                                                                           
## [650] ""                                                                                                                                                                                                                                                                                                                
## [651] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [652] "\"Multiple-Core Processors for Finance Applications,\" 2006, "                                                                                                                                                                                                                                                   
## [653] "<I>Journal of Investment Management</I>, v4(2), 76-81."                                                                                                                                                                                                                                                          
## [654] "</LI>"                                                                                                                                                                                                                                                                                                           
## [655] ""                                                                                                                                                                                                                                                                                                                
## [656] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [657] "\"Power Laws,\" 2005, (with Jacob Sisk), "                                                                                                                                                                                                                                                                       
## [658] "<I>Journal of Investment Management</I>, v3(3), 84-91."                                                                                                                                                                                                                                                          
## [659] "<a href=\"https://www.joim.com/ArticleContainer.asp?artID=154\">[PDF]</a>"                                                                                                                                                                                                                                       
## [660] "</LI>"                                                                                                                                                                                                                                                                                                           
## [661] ""                                                                                                                                                                                                                                                                                                                
## [662] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [663] "\"Genetic Algorithms,\" 2005,"                                                                                                                                                                                                                                                                                   
## [664] "<I>Journal of Investment Management</I>, v3(2), 77-82."                                                                                                                                                                                                                                                          
## [665] "</LI>"                                                                                                                                                                                                                                                                                                           
## [666] ""                                                                                                                                                                                                                                                                                                                
## [667] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [668] "\"Recovery Risk,\" 2005,"                                                                                                                                                                                                                                                                                        
## [669] "<I>Journal of Investment Management</I>, v3(1), 113-120."                                                                                                                                                                                                                                                        
## [670] "</LI>"                                                                                                                                                                                                                                                                                                           
## [671] ""                                                                                                                                                                                                                                                                                                                
## [672] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [673] "\"Venture Capital Syndication\", (with Hoje Jo and Yongtae Kim), 2004"                                                                                                                                                                                                                                           
## [674] "<I>Journal of Investment Management</I>, v2(4), 132-143."                                                                                                                                                                                                                                                        
## [675] "</LI>"                                                                                                                                                                                                                                                                                                           
## [676] ""                                                                                                                                                                                                                                                                                                                
## [677] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [678] "\"Technical Analysis\", (with David Tien), 2004"                                                                                                                                                                                                                                                                 
## [679] "<I>Journal of Investment Management</I>, v2(1), 79-85."                                                                                                                                                                                                                                                          
## [680] "</LI>"                                                                                                                                                                                                                                                                                                           
## [681] ""                                                                                                                                                                                                                                                                                                                
## [682] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [683] "\"Liquidity and the Bond Markets, (with Jan Ericsson and "                                                                                                                                                                                                                                                       
## [684] "Madhu Kalimipalli), 2003,"                                                                                                                                                                                                                                                                                       
## [685] "<I>Journal of Investment Management</I>, v1(4), 95-103."                                                                                                                                                                                                                                                         
## [686] "</LI>"                                                                                                                                                                                                                                                                                                           
## [687] ""                                                                                                                                                                                                                                                                                                                
## [688] "<LI><img src=\"graphics/JEL_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [689] "\"Modern Pricing of Interest Rate Derivatives - Book Review\", "                                                                                                                                                                                                                                                 
## [690] "2004, <I>Journal of Economic Literature</I>, vXLII, 528-529."                                                                                                                                                                                                                                                    
## [691] ""                                                                                                                                                                                                                                                                                                                
## [692] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [693] "\"Contagion\", 2003,"                                                                                                                                                                                                                                                                                            
## [694] "<I>Journal of Investment Management</I>, v1(3), 78-84."                                                                                                                                                                                                                                                          
## [695] "</LI>"                                                                                                                                                                                                                                                                                                           
## [696] ""                                                                                                                                                                                                                                                                                                                
## [697] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [698] "\"Hedge Funds\", 2003,"                                                                                                                                                                                                                                                                                          
## [699] "<I>Journal of Investment Management</I>, v1(2), 76-81."                                                                                                                                                                                                                                                          
## [700] "Reprinted in "                                                                                                                                                                                                                                                                                                   
## [701] "\"Working Papers on Hedge Funds,\" in The World of Hedge Funds: "                                                                                                                                                                                                                                                
## [702] "Characteristics and "                                                                                                                                                                                                                                                                                            
## [703] "Analysis, 2005, World Scientific."                                                                                                                                                                                                                                                                               
## [704] "</LI>"                                                                                                                                                                                                                                                                                                           
## [705] ""                                                                                                                                                                                                                                                                                                                
## [706] "<LI><img src=\"graphics/JOIM_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [707] "\"The Internet and Investors\", 2003,"                                                                                                                                                                                                                                                                           
## [708] "<I>Journal of Investment Management</I>, v1(1), 213-217."                                                                                                                                                                                                                                                        
## [709] "</LI>"                                                                                                                                                                                                                                                                                                           
## [710] ""                                                                                                                                                                                                                                                                                                                
## [711] "<LI><img src=\"graphics/EC_cover.gif\">"                                                                                                                                                                                                                                                                         
## [712] "  \"Useful things to know about Correlated Default Risk,\""                                                                                                                                                                                                                                                      
## [713] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                                                                                                                                                                                             
## [714] "2001,&nbsp; <i>Extra Credit</i>, November-December, 14-23."                                                                                                                                                                                                                                                      
## [715] "</LI>"                                                                                                                                                                                                                                                                                                           
## [716] ""                                                                                                                                                                                                                                                                                                                
## [717] "<LI><img src=\"graphics/QAFM_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [718] "\"The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "                                                                                                                                                                                                                                  
## [719] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"                                                                                                                                                                                                                                                       
## [720] "Courant Institute of Mathematical Sciences, special volume on"                                                                                                                                                                                                                                                   
## [721] "<I>Quantitative Analysis in Financial Markets</I>, Volume III, 2001."                                                                                                                                                                                                                                            
## [722] "</LI>"                                                                                                                                                                                                                                                                                                           
## [723] ""                                                                                                                                                                                                                                                                                                                
## [724] "<LI><img src=\"graphics/QAFM_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                            
## [725] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                                                                                                                                                                                                  
## [726] "(with Rangarajan Sundaram), reprinted in "                                                                                                                                                                                                                                                                       
## [727] "the Courant Institute of Mathematical Sciences, special volume on"                                                                                                                                                                                                                                               
## [728] "<I>Quantitative Analysis in Financial Markets</I>, Volume III, 2001."                                                                                                                                                                                                                                            
## [729] "</LI>"                                                                                                                                                                                                                                                                                                           
## [730] ""                                                                                                                                                                                                                                                                                                                
## [731] "<LI><img src=\"graphics/AFIVT_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [732] "\"Stochastic Mean Models of the Term Structure,''"                                                                                                                                                                                                                                                               
## [733] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "                                                                                                                                                                                                                                            
## [734] "2000, <I>Advanced Fixed-Income Valuation Tools"                                                                                                                                                                                                                                                                  
## [735] "</I>, edited by N. Jegadeesh and B. Tuckman,"                                                                                                                                                                                                                                                                    
## [736] "John Wiley & Sons, Inc., 128-161."                                                                                                                                                                                                                                                                               
## [737] "</LI>"                                                                                                                                                                                                                                                                                                           
## [738] ""                                                                                                                                                                                                                                                                                                                
## [739] "<LI><img src=\"graphics/AFIVT_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                           
## [740] "\"Interest Rate Modeling with Jump-Diffusion Processes,'' "                                                                                                                                                                                                                                                      
## [741] "2000, <I>Advanced Fixed-Income Valuation Tools"                                                                                                                                                                                                                                                                  
## [742] "</I>, edited by N. Jegadeesh and B. Tuckman,"                                                                                                                                                                                                                                                                    
## [743] "John Wiley & Sons, Inc., 162-189."                                                                                                                                                                                                                                                                               
## [744] "</LI>"                                                                                                                                                                                                                                                                                                           
## [745] ""                                                                                                                                                                                                                                                                                                                
## [746] "<LI><img src=\"graphics/FCR_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [747] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"                                                                                                                                                                                                                                               
## [748] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"                                                                                                                                                                                                                                        
## [749] "in <I>The Financing of Catastrophe Risk</I>, Kenneth A"                                                                                                                                                                                                                                                          
## [750] "Froot (Ed.), University of Chicago Press, 1999, 141-145."                                                                                                                                                                                                                                                        
## [751] "</LI>"                                                                                                                                                                                                                                                                                                           
## [752] ""                                                                                                                                                                                                                                                                                                                
## [753] "<LI><img src=\"graphics/HCD_cover.jpg\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [754] "  \"Pricing Credit Derivatives,'' "                                                                                                                                                                                                                                                                              
## [755] "1999, <I>Handbook of Credit Derivatives</I>, eds J. Francis,"                                                                                                                                                                                                                                                    
## [756] "J. Frost and J.G. Whittaker, 101-138."                                                                                                                                                                                                                                                                           
## [757] "</LI>"                                                                                                                                                                                                                                                                                                           
## [758] ""                                                                                                                                                                                                                                                                                                                
## [759] "<LI><img src=\"graphics/PEC_cover.gif\" width=\"40\" height=\"55\">"                                                                                                                                                                                                                                             
## [760] "\"On the Recursive Implementation of Term Structure Models,'' "                                                                                                                                                                                                                                                  
## [761] "1998, <I>Pecunia</I>, The Netherlands, Summer 1998, 45-49."                                                                                                                                                                                                                                                      
## [762] "</LI>"                                                                                                                                                                                                                                                                                                           
## [763] ""                                                                                                                                                                                                                                                                                                                
## [764] ""                                                                                                                                                                                                                                                                                                                
## [765] "</OL>"                                                                                                                                                                                                                                                                                                           
## [766] ""                                                                                                                                                                                                                                                                                                                
## [767] ""                                                                                                                                                                                                                                                                                                                
## [768] "<H2>WORKING PAPERS</H2>"                                                                                                                                                                                                                                                                                         
## [769] ""                                                                                                                                                                                                                                                                                                                
## [770] "<OL>"                                                                                                                                                                                                                                                                                                            
## [771] ""                                                                                                                                                                                                                                                                                                                
## [772] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [773] "”Local Volatility and the Recovery Rate of Credit Default Swaps”, "                                                                                                                                                                                                                                              
## [774] "(with Jeroen Jansen and Frank Fabozzi)."                                                                                                                                                                                                                                                                         
## [775] "<a href=\"Papers/LocalVolatility.pdf\">[PDF]</a>. "                                                                                                                                                                                                                                                              
## [776] ""                                                                                                                                                                                                                                                                                                                
## [777] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [778] "\"Efficient Rebalancing of Taxable Portfolios\" (with Dan Ostrov, Dennis Ding, Vincent Newell), "                                                                                                                                                                                                                
## [779] "<a href=\"Papers/taxopt.pdf\">[PDF]</a>. "                                                                                                                                                                                                                                                                       
## [780] "<a href=\"Papers/taxopt_slides_RFinance_2015_05.pdf\">[SLIDES RFinance]</a>. "                                                                                                                                                                                                                                   
## [781] "<a href=\"Papers/taxopt_slides2.pdf\">[SLIDES JOIM]</a>. "                                                                                                                                                                                                                                                       
## [782] ""                                                                                                                                                                                                                                                                                                                
## [783] ""                                                                                                                                                                                                                                                                                                                
## [784] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [785] "\"The Fast and the Curious: VC Drift\" "                                                                                                                                                                                                                                                                         
## [786] "(with Amit Bubna and Paul Hanouna), "                                                                                                                                                                                                                                                                            
## [787] "<a href=\"Papers/vcstyle.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                        
## [788] ""                                                                                                                                                                                                                                                                                                                
## [789] ""                                                                                                                                                                                                                                                                                                                
## [790] "<LI><img src=\"graphics/frog2.gif\">"                                                                                                                                                                                                                                                                            
## [791] "\"Venture Capital Communities\" (with Amit Bubna and Nagpurnanand Prabhala), "                                                                                                                                                                                                                                   
## [792] "<a href=\"Papers/vccomm.pdf\">[PDF]</a>"                                                                                                                                                                                                                                                                         
## [793] ""                                                                                                                                                                                                                                                                                                                
## [794] ""                                                                                                                                                                                                                                                                                                                
## [795] ""                                                                                                                                                                                                                                                                                                                
## [796] ""                                                                                                                                                                                                                                                                                                                
## [797] ""                                                                                                                                                                                                                                                                                                                
## [798] "</OL>"                                                                                                                                                                                                                                                                                                           
## [799] ""                                                                                                                                                                                                                                                                                                                
## [800] ""                                                                                                                                                                                                                                                                                                                
## [801] ""                                                                                                                                                                                                                                                                                                                
## [802] ""                                                                                                                                                                                                                                                                                                                
## [803] ""                                                                                                                                                                                                                                                                                                                
## [804] ""                                                                                                                                                                                                                                                                                                                
## [805] ""                                                                                                                                                                                                                                                                                                                
## [806] ""                                                                                                                                                                                                                                                                                                                
## [807] "</UL>"                                                                                                                                                                                                                                                                                                           
## [808] "<p>"                                                                                                                                                                                                                                                                                                             
## [809] "My page on SSRN (with downloadable papers) is <a"                                                                                                                                                                                                                                                                
## [810] "href=\"http://ssrn.com/author=17108\">here</a>."                                                                                                                                                                                                                                                                 
## [811] ""                                                                                                                                                                                                                                                                                                                
## [812] ""                                                                                                                                                                                                                                                                                                                
## [813] ""                                                                                                                                                                                                                                                                                                                
## [814] "                                                "                                                                                                                                                                                                                                                                
## [815] ""                                                                                                                                                                                                                                                                                                                
## [816] ""                                                                                                                                                                                                                                                                                                                
## [817] ""                                                                                                                                                                                                                                                                                                                
## [818] "</BODY>"                                                                                                                                                                                                                                                                                                         
## [819] ""                                                                                                                                                                                                                                                                                                                
## [820] "</HTML>"                                                                                                                                                                                                                                                                                                         
## [821] ""                                                                                                                                                                                                                                                                                                                
## [822] ""                                                                                                                                                                                                                                                                                                                
## [823] ""
text = text[setdiff(seq(1,length(text)),grep("<",text))]
text = text[setdiff(seq(1,length(text)),grep(">",text))]
text = text[setdiff(seq(1,length(text)),grep("]",text))]
text = text[setdiff(seq(1,length(text)),grep("}",text))]
text = text[setdiff(seq(1,length(text)),grep("_",text))]
text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
print(length(text))
## [1] 336
print(text)
##   [1] ""                                                                                                                                    
##   [2] ""                                                                                                                                    
##   [3] ""                                                                                                                                    
##   [4] "\"Data Science: Theories, Models, Algorithms, and Analytics\" (web book -- work in progress)"                                        
##   [5] ""                                                                                                                                    
##   [6] ""                                                                                                                                    
##   [7] "\"Derivatives: Principles and Practice\" (2010),"                                                                                    
##   [8] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."                                                                                  
##   [9] ""                                                                                                                                    
##  [10] ""                                                                                                                                    
##  [11] ""                                                                                                                                    
##  [12] ""                                                                                                                                    
##  [13] "\"An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."                                                 
##  [14] ""                                                                                                                                    
##  [15] "\"Matrix Metrics: Network-Based Systemic Risk Scoring\", (2016)."                                                                    
##  [16] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "                                               
##  [17] "the best paper on SIFIs (systemically important financial institutions). "                                                           
##  [18] "It also won the best paper award at "                                                                                                
##  [19] ""                                                                                                                                    
##  [20] ""                                                                                                                                    
##  [21] ""                                                                                                                                    
##  [22] ""                                                                                                                                    
##  [23] "\"Credit Spreads with Dynamic Debt\" (with Seoyoung Kim), (2015), "                                                                  
##  [24] ""                                                                                                                                    
##  [25] "\"Text and Context: Language Analytics for Finance\", (2014),"                                                                       
##  [26] ""                                                                                                                                    
##  [27] ""                                                                                                                                    
##  [28] ""                                                                                                                                    
##  [29] "\"Strategic Loan Modification: An Options-Based Response to Strategic Default,\""                                                    
##  [30] ""                                                                                                                                    
##  [31] ""                                                                                                                                    
##  [32] "\"Options and Structured Products in Behavioral Portfolios,\" (with Meir Statman), (2013), "                                         
##  [33] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."                                                      
##  [34] ""                                                                                                                                    
##  [35] ""                                                                                                                                    
##  [36] ""                                                                                                                                    
##  [37] "\"Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance,\" (2011), (with Hoje Jo and Yongtae Kim), "        
##  [38] ""                                                                                                                                    
##  [39] "Optimization with Mental Accounts,\" (2010), (with Harry Markowitz, Jonathan"                                                        
##  [40] ""                                                                                                                                    
##  [41] ""                                                                                                                                    
##  [42] ""                                                                                                                                    
##  [43] ""                                                                                                                                    
##  [44] ""                                                                                                                                    
##  [45] ""                                                                                                                                    
##  [46] ""                                                                                                                                    
##  [47] ""                                                                                                                                    
##  [48] "\"Accounting-based versus market-based cross-sectional models of CDS spreads,\" "                                                    
##  [49] "(with Paul Hanouna and Atulya Sarin), (2009), "                                                                                      
##  [50] ""                                                                                                                                    
##  [51] ""                                                                                                                                    
##  [52] "\"Hedging Credit: Equity Liquidity Matters,\" (with Paul Hanouna), (2009),"                                                          
##  [53] ""                                                                                                                                    
##  [54] "\"An Integrated Model for Hybrid Securities,\""                                                                                      
##  [55] ""                                                                                                                                    
##  [56] "\"Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,\""                                                              
##  [57] ""                                                                                                                                    
##  [58] "\"Common Failings: How Corporate Defaults are Correlated\" "                                                                         
##  [59] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."                                                                            
##  [60] ""                                                                                                                                    
##  [61] "\"A Clinical Study of Investor Discussion and Sentiment,\" "                                                                         
##  [62] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "                                                                                 
##  [63] ""                                                                                                                                    
##  [64] "\"International Portfolio Choice with Systemic Risk,\""                                                                              
##  [65] "The loss resulting from diminished diversification is small, while"                                                                  
##  [66] ""                                                                                                                                    
##  [67] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"                                                                 
##  [68] "investor welfare. Contrary to regulatory intuition, incentive structures"                                                            
##  [69] ""                                                                                                                                    
##  [70] "\"A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"                                                            
##  [71] "with Rating Transitions,\" (with Viral Acharya and Rangarajan Sundaram),"                                                            
##  [72] ""                                                                                                                                    
##  [73] ""                                                                                                                                    
##  [74] "\"Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"                                                     
##  [75] ""                                                                                                                                    
##  [76] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                      
##  [77] ""                                                                                                                                    
##  [78] "\"The Psychology of Financial Decision Making: A Case"                                                                               
##  [79] "for Theory-Driven Experimental Enquiry,''"                                                                                           
##  [80] "1999, (with Priya Raghubir),"                                                                                                        
##  [81] ""                                                                                                                                    
##  [82] "\"Of Smiles and Smirks: A Term Structure Perspective,''"                                                                             
##  [83] ""                                                                                                                                    
##  [84] "\"A Theory of Banking Structure,\" 1999, (with Ashish Nanda),"                                                                       
##  [85] "by function based upon two dimensions: the degree of information asymmetry "                                                         
##  [86] ""                                                                                                                                    
##  [87] "\"A Theory of Optimal Timing and Selectivity,'' "                                                                                    
##  [88] ""                                                                                                                                    
##  [89] "\"A Direct Discrete-Time Approach to"                                                                                                
##  [90] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "                                                                    
##  [91] ""                                                                                                                                    
##  [92] "\"The Central Tendency: A Second Factor in"                                                                                          
##  [93] "Bond Yields,\" 1998, (with Silverio Foresi and Pierluigi Balduzzi),  "                                                               
##  [94] ""                                                                                                                                    
##  [95] "\"Efficiency with Costly Information: A Reinterpretation of"                                                                         
##  [96] "Evidence from Managed Portfolios,\" (with Edwin Elton, Martin Gruber and Matt "                                                      
##  [97] "Presented and Reprinted in the Proceedings of The "                                                                                  
##  [98] "Seminar on the Analysis of Security Prices at the Center "                                                                           
##  [99] "for Research in Security   Prices  at the University of "                                                                            
## [100] ""                                                                                                                                    
## [101] ""                                                                                                                                    
## [102] ""                                                                                                                                    
## [103] ""                                                                                                                                    
## [104] ""                                                                                                                                    
## [105] ""                                                                                                                                    
## [106] ""                                                                                                                                    
## [107] "\"Managing Rollover Risk with Capital Structure Covenants"                                                                           
## [108] "in Structured Finance Vehicles\" (2016),"                                                                                            
## [109] ""                                                                                                                                    
## [110] ""                                                                                                                                    
## [111] ""                                                                                                                                    
## [112] "\"The Design and Risk Management of Structured Finance Vehicles\" (2016),"                                                           
## [113] "Post the recent subprime financial crisis, we inform the creation of safer SIVs "                                                    
## [114] "in structured finance, and propose avenues of mitigating risks faced by senior debt through "                                        
## [115] ""                                                                                                                                    
## [116] ""                                                                                                                                    
## [117] "\"Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework\" (2014), "                                                
## [118] "(with Seoyoung Kim and Meir Statman),  "                                                                                             
## [119] ""                                                                                                                                    
## [120] ""                                                                                                                                    
## [121] "\"Going for Broke: Restructuring Distressed Debt Portfolios\" (2014),"                                                               
## [122] ""                                                                                                                                    
## [123] ""                                                                                                                                    
## [124] "\"Digital Portfolios.\" (2013), "                                                                                                    
## [125] ""                                                                                                                                    
## [126] ""                                                                                                                                    
## [127] "\"Options on Portfolios with Higher-Order Moments,\" (2009),"                                                                        
## [128] "options on a multivariate system of assets, calibrated to the return "                                                               
## [129] ""                                                                                                                                    
## [130] ""                                                                                                                                    
## [131] "\"Dealing with Dimension: Option Pricing on Factor Trees,\" (2009),"                                                                 
## [132] "you to price options on multiple assets in a unified fraamework. Computational"                                                      
## [133] ""                                                                                                                                    
## [134] ""                                                                                                                                    
## [135] "\"Modeling"                                                                                                                          
## [136] "Correlated Default with a Forest of Binomial Trees,\" (2007), (with"                                                                 
## [137] ""                                                                                                                                    
## [138] "\"Basel II: Correlation Related Issues\" (2007), "                                                                                   
## [139] ""                                                                                                                                    
## [140] "\"Correlated Default Risk,\" (2006),"                                                                                                
## [141] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                               
## [142] "increase as markets worsen. Regime switching models are needed to explain dynamic"                                                   
## [143] ""                                                                                                                                    
## [144] "\"A Simple Model for Pricing Equity Options with Markov"                                                                             
## [145] "Switching State Variables\" (2006),"                                                                                                 
## [146] "(with Donald Aingworth and Rajeev Motwani),"                                                                                         
## [147] ""                                                                                                                                    
## [148] "\"The Firm's Management of Social Interactions,\" (2005)"                                                                            
## [149] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "                                                                        
## [150] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "                                                                           
## [151] ""                                                                                                                                    
## [152] "\"Financial Communities\" (with Jacob Sisk), 2005, "                                                                                 
## [153] "Summer, 112-123."                                                                                                                    
## [154] ""                                                                                                                                    
## [155] "\"Monte Carlo Markov Chain Methods for Derivative Pricing"                                                                           
## [156] "and Risk Assessment,\"(with Alistair Sinclair), 2005, "                                                                              
## [157] "where incomplete information about the value of an asset may be exploited to "                                                       
## [158] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "                                                      
## [159] ""                                                                                                                                    
## [160] "\"Correlated Default Processes: A Criterion-Based Copula Approach,\""                                                                
## [161] "Special Issue on Default Risk. "                                                                                                     
## [162] ""                                                                                                                                    
## [163] "\"Private Equity Returns: An Empirical Examination of the Exit of"                                                                   
## [164] "Venture-Backed Companies,\" (with Murali Jagannathan and Atulya Sarin),"                                                             
## [165] "firm being financed, the valuation at the time of financing, and the prevailing market"                                              
## [166] "sentiment. Helps understand the risk premium required for the"                                                                       
## [167] ""                                                                                                                                    
## [168] "Issue on Computational Methods in Economics and Finance),  "                                                                         
## [169] "December, 55-69."                                                                                                                    
## [170] ""                                                                                                                                    
## [171] "\"Bayesian Migration in Credit Ratings Based on Probabilities of"                                                                    
## [172] ""                                                                                                                                    
## [173] "\"The Impact of Correlated Default Risk on Credit Portfolios,\""                                                                     
## [174] "(with Gifford Fong, and Gary Geng),"                                                                                                 
## [175] ""                                                                                                                                    
## [176] "\"How Diversified are Internationally Diversified Portfolios:"                                                                       
## [177] "Time-Variation in the Covariances between International Returns,\""                                                                  
## [178] ""                                                                                                                                    
## [179] "\"Discrete-Time Bond and Option Pricing for Jump-Diffusion"                                                                          
## [180] ""                                                                                                                                    
## [181] "\"Macroeconomic Implications of Search Theory for the Labor Market,\""                                                               
## [182] ""                                                                                                                                    
## [183] "\"Auction Theory: A Summary with Applications and Evidence"                                                                          
## [184] "from the Treasury Markets,\" 1996, (with Rangarajan Sundaram),"                                                                      
## [185] ""                                                                                                                                    
## [186] "\"A Simple Approach to Three Factor Affine Models of the"                                                                            
## [187] "Term Structure,\" (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"                                                          
## [188] ""                                                                                                                                    
## [189] "\"Analytical Approximations of  the Term Structure"                                                                                  
## [190] "for Jump-diffusion Processes: A Numerical Analysis,\" 1996, "                                                                        
## [191] ""                                                                                                                                    
## [192] "Markov Chain Term Structure Models: Extensions and Applications,\""                                                                  
## [193] ""                                                                                                                                    
## [194] ""                                                                                                                                    
## [195] "\"Exact Solutions for Bond and Options Prices"                                                                                       
## [196] "with Systematic Jump Risk,\" 1996, (with Silverio Foresi),"                                                                          
## [197] ""                                                                                                                                    
## [198] "\"Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"                                                                 
## [199] "and Credit Spreads are Stochastic,\" 1996, "                                                                                         
## [200] "v5(2), 161-198."                                                                                                                     
## [201] ""                                                                                                                                    
## [202] ""                                                                                                                                    
## [203] ""                                                                                                                                    
## [204] ""                                                                                                                                    
## [205] ""                                                                                                                                    
## [206] ""                                                                                                                                    
## [207] ""                                                                                                                                    
## [208] "\"Did CDS Trading Improve the Market for Corporate Bonds,\" (2016), "                                                                
## [209] "(with Madhu Kalimipalli and Subhankar Nayak), "                                                                                      
## [210] ""                                                                                                                                    
## [211] "\"Big Data's Big Muscle,\" (2016), "                                                                                                 
## [212] ""                                                                                                                                    
## [213] ""                                                                                                                                    
## [214] "\"Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier,\" (2011), "          
## [215] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "                                                                         
## [216] ""                                                                                                                                    
## [217] "\"News Analytics: Framework, Techniques and Metrics,\" The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [218] ""                                                                                                                                    
## [219] ""                                                                                                                                    
## [220] ""                                                                                                                                    
## [221] "\"Random Lattices for Option Pricing Problems in Finance,\" (2011),"                                                                 
## [222] ""                                                                                                                                    
## [223] ""                                                                                                                                    
## [224] "\"Implementing Option Pricing Models using Python and Cython,\" (2010),"                                                             
## [225] ""                                                                                                                                    
## [226] ""                                                                                                                                    
## [227] ""                                                                                                                                    
## [228] "\"The Finance Web: Internet Information and Markets,\" (2010), "                                                                     
## [229] ""                                                                                                                                    
## [230] ""                                                                                                                                    
## [231] "\"Financial Applications with Parallel R,\" (2009), "                                                                                
## [232] ""                                                                                                                                    
## [233] ""                                                                                                                                    
## [234] "\"Recovery Swaps,\" (2009), (with Paul Hanouna),  "                                                                                  
## [235] ""                                                                                                                                    
## [236] "\"Recovery Rates,\" (2009),(with Paul Hanouna), "                                                                                    
## [237] ""                                                                                                                                    
## [238] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "                                                     
## [239] ""                                                                                                                                    
## [240] ""                                                                                                                                    
## [241] "\"Credit Default Swap Spreads\", 2006, (with Paul Hanouna), "                                                                        
## [242] ""                                                                                                                                    
## [243] "\"Multiple-Core Processors for Finance Applications,\" 2006, "                                                                       
## [244] ""                                                                                                                                    
## [245] "\"Power Laws,\" 2005, (with Jacob Sisk), "                                                                                           
## [246] ""                                                                                                                                    
## [247] "\"Genetic Algorithms,\" 2005,"                                                                                                       
## [248] ""                                                                                                                                    
## [249] "\"Recovery Risk,\" 2005,"                                                                                                            
## [250] ""                                                                                                                                    
## [251] "\"Venture Capital Syndication\", (with Hoje Jo and Yongtae Kim), 2004"                                                               
## [252] ""                                                                                                                                    
## [253] "\"Technical Analysis\", (with David Tien), 2004"                                                                                     
## [254] ""                                                                                                                                    
## [255] "\"Liquidity and the Bond Markets, (with Jan Ericsson and "                                                                           
## [256] "Madhu Kalimipalli), 2003,"                                                                                                           
## [257] ""                                                                                                                                    
## [258] "\"Modern Pricing of Interest Rate Derivatives - Book Review\", "                                                                     
## [259] ""                                                                                                                                    
## [260] "\"Contagion\", 2003,"                                                                                                                
## [261] ""                                                                                                                                    
## [262] "\"Hedge Funds\", 2003,"                                                                                                              
## [263] "Reprinted in "                                                                                                                       
## [264] "\"Working Papers on Hedge Funds,\" in The World of Hedge Funds: "                                                                    
## [265] "Characteristics and "                                                                                                                
## [266] "Analysis, 2005, World Scientific."                                                                                                   
## [267] ""                                                                                                                                    
## [268] "\"The Internet and Investors\", 2003,"                                                                                               
## [269] ""                                                                                                                                    
## [270] "  \"Useful things to know about Correlated Default Risk,\""                                                                          
## [271] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                 
## [272] ""                                                                                                                                    
## [273] "\"The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "                                                      
## [274] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"                                                                           
## [275] "Courant Institute of Mathematical Sciences, special volume on"                                                                       
## [276] ""                                                                                                                                    
## [277] "\"A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                      
## [278] "(with Rangarajan Sundaram), reprinted in "                                                                                           
## [279] "the Courant Institute of Mathematical Sciences, special volume on"                                                                   
## [280] ""                                                                                                                                    
## [281] "\"Stochastic Mean Models of the Term Structure,''"                                                                                   
## [282] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "                                                                
## [283] "John Wiley & Sons, Inc., 128-161."                                                                                                   
## [284] ""                                                                                                                                    
## [285] "\"Interest Rate Modeling with Jump-Diffusion Processes,'' "                                                                          
## [286] "John Wiley & Sons, Inc., 162-189."                                                                                                   
## [287] ""                                                                                                                                    
## [288] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"                                                                   
## [289] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"                                                            
## [290] "Froot (Ed.), University of Chicago Press, 1999, 141-145."                                                                            
## [291] ""                                                                                                                                    
## [292] "  \"Pricing Credit Derivatives,'' "                                                                                                  
## [293] "J. Frost and J.G. Whittaker, 101-138."                                                                                               
## [294] ""                                                                                                                                    
## [295] "\"On the Recursive Implementation of Term Structure Models,'' "                                                                      
## [296] ""                                                                                                                                    
## [297] ""                                                                                                                                    
## [298] ""                                                                                                                                    
## [299] ""                                                                                                                                    
## [300] ""                                                                                                                                    
## [301] ""                                                                                                                                    
## [302] "”Local Volatility and the Recovery Rate of Credit Default Swaps”, "                                                                  
## [303] "(with Jeroen Jansen and Frank Fabozzi)."                                                                                             
## [304] ""                                                                                                                                    
## [305] "\"Efficient Rebalancing of Taxable Portfolios\" (with Dan Ostrov, Dennis Ding, Vincent Newell), "                                    
## [306] ""                                                                                                                                    
## [307] ""                                                                                                                                    
## [308] "\"The Fast and the Curious: VC Drift\" "                                                                                             
## [309] "(with Amit Bubna and Paul Hanouna), "                                                                                                
## [310] ""                                                                                                                                    
## [311] ""                                                                                                                                    
## [312] "\"Venture Capital Communities\" (with Amit Bubna and Nagpurnanand Prabhala), "                                                       
## [313] ""                                                                                                                                    
## [314] ""                                                                                                                                    
## [315] ""                                                                                                                                    
## [316] ""                                                                                                                                    
## [317] ""                                                                                                                                    
## [318] ""                                                                                                                                    
## [319] ""                                                                                                                                    
## [320] ""                                                                                                                                    
## [321] ""                                                                                                                                    
## [322] ""                                                                                                                                    
## [323] ""                                                                                                                                    
## [324] ""                                                                                                                                    
## [325] ""                                                                                                                                    
## [326] ""                                                                                                                                    
## [327] ""                                                                                                                                    
## [328] ""                                                                                                                                    
## [329] "                                                "                                                                                    
## [330] ""                                                                                                                                    
## [331] ""                                                                                                                                    
## [332] ""                                                                                                                                    
## [333] ""                                                                                                                                    
## [334] ""                                                                                                                                    
## [335] ""                                                                                                                                    
## [336] ""
text = str_replace_all(text,"[\"]","")
idx = which(nchar(text)==0)
research = text[setdiff(seq(1,length(text)),idx)]
print(research)
##   [1] "Data Science: Theories, Models, Algorithms, and Analytics (web book -- work in progress)"                                        
##   [2] "Derivatives: Principles and Practice (2010),"                                                                                    
##   [3] "(Rangarajan Sundaram and Sanjiv Das), McGraw Hill."                                                                              
##   [4] "An Index-Based Measure of Liquidity,'' (with George Chacko and Rong Fan), (2016)."                                               
##   [5] "Matrix Metrics: Network-Based Systemic Risk Scoring, (2016)."                                                                    
##   [6] "of systemic risk. This paper won the First Prize in the MIT-CFP competition 2016 for "                                           
##   [7] "the best paper on SIFIs (systemically important financial institutions). "                                                       
##   [8] "It also won the best paper award at "                                                                                            
##   [9] "Credit Spreads with Dynamic Debt (with Seoyoung Kim), (2015), "                                                                  
##  [10] "Text and Context: Language Analytics for Finance, (2014),"                                                                       
##  [11] "Strategic Loan Modification: An Options-Based Response to Strategic Default,"                                                    
##  [12] "Options and Structured Products in Behavioral Portfolios, (with Meir Statman), (2013), "                                         
##  [13] "and barrier range notes, in the presence of fat-tailed outcomes using copulas."                                                  
##  [14] "Polishing Diamonds in the Rough: The Sources of Syndicated Venture Performance, (2011), (with Hoje Jo and Yongtae Kim), "        
##  [15] "Optimization with Mental Accounts, (2010), (with Harry Markowitz, Jonathan"                                                      
##  [16] "Accounting-based versus market-based cross-sectional models of CDS spreads, "                                                    
##  [17] "(with Paul Hanouna and Atulya Sarin), (2009), "                                                                                  
##  [18] "Hedging Credit: Equity Liquidity Matters, (with Paul Hanouna), (2009),"                                                          
##  [19] "An Integrated Model for Hybrid Securities,"                                                                                      
##  [20] "Yahoo for Amazon! Sentiment Extraction from Small Talk on the Web,"                                                              
##  [21] "Common Failings: How Corporate Defaults are Correlated "                                                                         
##  [22] "(with Darrell Duffie, Nikunj Kapadia and Leandro Saita)."                                                                        
##  [23] "A Clinical Study of Investor Discussion and Sentiment, "                                                                         
##  [24] "(with Asis Martinez-Jerez and Peter Tufano), 2005, "                                                                             
##  [25] "International Portfolio Choice with Systemic Risk,"                                                                              
##  [26] "The loss resulting from diminished diversification is small, while"                                                              
##  [27] "Speech: Signaling, Risk-sharing and the Impact of Fee Structures on"                                                             
##  [28] "investor welfare. Contrary to regulatory intuition, incentive structures"                                                        
##  [29] "A Discrete-Time Approach to No-arbitrage Pricing of Credit derivatives"                                                          
##  [30] "with Rating Transitions, (with Viral Acharya and Rangarajan Sundaram),"                                                          
##  [31] "Pricing Interest Rate Derivatives: A General Approach,''(with George Chacko),"                                                   
##  [32] "A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                    
##  [33] "The Psychology of Financial Decision Making: A Case"                                                                             
##  [34] "for Theory-Driven Experimental Enquiry,''"                                                                                       
##  [35] "1999, (with Priya Raghubir),"                                                                                                    
##  [36] "Of Smiles and Smirks: A Term Structure Perspective,''"                                                                           
##  [37] "A Theory of Banking Structure, 1999, (with Ashish Nanda),"                                                                       
##  [38] "by function based upon two dimensions: the degree of information asymmetry "                                                     
##  [39] "A Theory of Optimal Timing and Selectivity,'' "                                                                                  
##  [40] "A Direct Discrete-Time Approach to"                                                                                              
##  [41] "Poisson-Gaussian Bond Option Pricing in the Heath-Jarrow-Morton "                                                                
##  [42] "The Central Tendency: A Second Factor in"                                                                                        
##  [43] "Bond Yields, 1998, (with Silverio Foresi and Pierluigi Balduzzi),  "                                                             
##  [44] "Efficiency with Costly Information: A Reinterpretation of"                                                                       
##  [45] "Evidence from Managed Portfolios, (with Edwin Elton, Martin Gruber and Matt "                                                    
##  [46] "Presented and Reprinted in the Proceedings of The "                                                                              
##  [47] "Seminar on the Analysis of Security Prices at the Center "                                                                       
##  [48] "for Research in Security   Prices  at the University of "                                                                        
##  [49] "Managing Rollover Risk with Capital Structure Covenants"                                                                         
##  [50] "in Structured Finance Vehicles (2016),"                                                                                          
##  [51] "The Design and Risk Management of Structured Finance Vehicles (2016),"                                                           
##  [52] "Post the recent subprime financial crisis, we inform the creation of safer SIVs "                                                
##  [53] "in structured finance, and propose avenues of mitigating risks faced by senior debt through "                                    
##  [54] "Coming up Short: Managing Underfunded Portfolios in an LDI-ES Framework (2014), "                                                
##  [55] "(with Seoyoung Kim and Meir Statman),  "                                                                                         
##  [56] "Going for Broke: Restructuring Distressed Debt Portfolios (2014),"                                                               
##  [57] "Digital Portfolios. (2013), "                                                                                                    
##  [58] "Options on Portfolios with Higher-Order Moments, (2009),"                                                                        
##  [59] "options on a multivariate system of assets, calibrated to the return "                                                           
##  [60] "Dealing with Dimension: Option Pricing on Factor Trees, (2009),"                                                                 
##  [61] "you to price options on multiple assets in a unified fraamework. Computational"                                                  
##  [62] "Modeling"                                                                                                                        
##  [63] "Correlated Default with a Forest of Binomial Trees, (2007), (with"                                                               
##  [64] "Basel II: Correlation Related Issues (2007), "                                                                                   
##  [65] "Correlated Default Risk, (2006),"                                                                                                
##  [66] "(with Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                                           
##  [67] "increase as markets worsen. Regime switching models are needed to explain dynamic"                                               
##  [68] "A Simple Model for Pricing Equity Options with Markov"                                                                           
##  [69] "Switching State Variables (2006),"                                                                                               
##  [70] "(with Donald Aingworth and Rajeev Motwani),"                                                                                     
##  [71] "The Firm's Management of Social Interactions, (2005)"                                                                            
##  [72] "(with D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, "                                                                    
##  [73] "B. Pfeieffer, B. Libai, S. Sen, M. Shi, and P. Verlegh). "                                                                       
##  [74] "Financial Communities (with Jacob Sisk), 2005, "                                                                                 
##  [75] "Summer, 112-123."                                                                                                                
##  [76] "Monte Carlo Markov Chain Methods for Derivative Pricing"                                                                         
##  [77] "and Risk Assessment,(with Alistair Sinclair), 2005, "                                                                            
##  [78] "where incomplete information about the value of an asset may be exploited to "                                                   
##  [79] "undertake fast and accurate pricing. Proof that a fully polynomial randomized "                                                  
##  [80] "Correlated Default Processes: A Criterion-Based Copula Approach,"                                                                
##  [81] "Special Issue on Default Risk. "                                                                                                 
##  [82] "Private Equity Returns: An Empirical Examination of the Exit of"                                                                 
##  [83] "Venture-Backed Companies, (with Murali Jagannathan and Atulya Sarin),"                                                           
##  [84] "firm being financed, the valuation at the time of financing, and the prevailing market"                                          
##  [85] "sentiment. Helps understand the risk premium required for the"                                                                   
##  [86] "Issue on Computational Methods in Economics and Finance),  "                                                                     
##  [87] "December, 55-69."                                                                                                                
##  [88] "Bayesian Migration in Credit Ratings Based on Probabilities of"                                                                  
##  [89] "The Impact of Correlated Default Risk on Credit Portfolios,"                                                                     
##  [90] "(with Gifford Fong, and Gary Geng),"                                                                                             
##  [91] "How Diversified are Internationally Diversified Portfolios:"                                                                     
##  [92] "Time-Variation in the Covariances between International Returns,"                                                                
##  [93] "Discrete-Time Bond and Option Pricing for Jump-Diffusion"                                                                        
##  [94] "Macroeconomic Implications of Search Theory for the Labor Market,"                                                               
##  [95] "Auction Theory: A Summary with Applications and Evidence"                                                                        
##  [96] "from the Treasury Markets, 1996, (with Rangarajan Sundaram),"                                                                    
##  [97] "A Simple Approach to Three Factor Affine Models of the"                                                                          
##  [98] "Term Structure, (with Pierluigi Balduzzi, Silverio Foresi and Rangarajan"                                                        
##  [99] "Analytical Approximations of  the Term Structure"                                                                                
## [100] "for Jump-diffusion Processes: A Numerical Analysis, 1996, "                                                                      
## [101] "Markov Chain Term Structure Models: Extensions and Applications,"                                                                
## [102] "Exact Solutions for Bond and Options Prices"                                                                                     
## [103] "with Systematic Jump Risk, 1996, (with Silverio Foresi),"                                                                        
## [104] "Pricing Credit Sensitive Debt when Interest Rates, Credit Ratings"                                                               
## [105] "and Credit Spreads are Stochastic, 1996, "                                                                                       
## [106] "v5(2), 161-198."                                                                                                                 
## [107] "Did CDS Trading Improve the Market for Corporate Bonds, (2016), "                                                                
## [108] "(with Madhu Kalimipalli and Subhankar Nayak), "                                                                                  
## [109] "Big Data's Big Muscle, (2016), "                                                                                                 
## [110] "Portfolios for Investors Who Want to Reach Their Goals While Staying on the Mean-Variance Efficient Frontier, (2011), "          
## [111] "(with Harry Markowitz, Jonathan Scheid, and Meir Statman), "                                                                     
## [112] "News Analytics: Framework, Techniques and Metrics, The Handbook of News Analytics in Finance, May 2011, John Wiley & Sons, U.K. "
## [113] "Random Lattices for Option Pricing Problems in Finance, (2011),"                                                                 
## [114] "Implementing Option Pricing Models using Python and Cython, (2010),"                                                             
## [115] "The Finance Web: Internet Information and Markets, (2010), "                                                                     
## [116] "Financial Applications with Parallel R, (2009), "                                                                                
## [117] "Recovery Swaps, (2009), (with Paul Hanouna),  "                                                                                  
## [118] "Recovery Rates, (2009),(with Paul Hanouna), "                                                                                    
## [119] "``A Simple Model for Pricing Securities with a Debt-Equity Linkage,'' 2008, in "                                                 
## [120] "Credit Default Swap Spreads, 2006, (with Paul Hanouna), "                                                                        
## [121] "Multiple-Core Processors for Finance Applications, 2006, "                                                                       
## [122] "Power Laws, 2005, (with Jacob Sisk), "                                                                                           
## [123] "Genetic Algorithms, 2005,"                                                                                                       
## [124] "Recovery Risk, 2005,"                                                                                                            
## [125] "Venture Capital Syndication, (with Hoje Jo and Yongtae Kim), 2004"                                                               
## [126] "Technical Analysis, (with David Tien), 2004"                                                                                     
## [127] "Liquidity and the Bond Markets, (with Jan Ericsson and "                                                                         
## [128] "Madhu Kalimipalli), 2003,"                                                                                                       
## [129] "Modern Pricing of Interest Rate Derivatives - Book Review, "                                                                     
## [130] "Contagion, 2003,"                                                                                                                
## [131] "Hedge Funds, 2003,"                                                                                                              
## [132] "Reprinted in "                                                                                                                   
## [133] "Working Papers on Hedge Funds, in The World of Hedge Funds: "                                                                    
## [134] "Characteristics and "                                                                                                            
## [135] "Analysis, 2005, World Scientific."                                                                                               
## [136] "The Internet and Investors, 2003,"                                                                                               
## [137] "  Useful things to know about Correlated Default Risk,"                                                                          
## [138] "(with Gifford Fong, Laurence Freed, Gary Geng, and Nikunj Kapadia),"                                                             
## [139] "The Regulation of Fee Structures in Mutual Funds: A Theoretical Analysis,'' "                                                    
## [140] "(with Rangarajan Sundaram), 1998, NBER WP No 6639, in the"                                                                       
## [141] "Courant Institute of Mathematical Sciences, special volume on"                                                                   
## [142] "A Discrete-Time Approach to Arbitrage-Free Pricing of Credit Derivatives,'' "                                                    
## [143] "(with Rangarajan Sundaram), reprinted in "                                                                                       
## [144] "the Courant Institute of Mathematical Sciences, special volume on"                                                               
## [145] "Stochastic Mean Models of the Term Structure,''"                                                                                 
## [146] "(with Pierluigi Balduzzi, Silverio Foresi and Rangarajan Sundaram), "                                                            
## [147] "John Wiley & Sons, Inc., 128-161."                                                                                               
## [148] "Interest Rate Modeling with Jump-Diffusion Processes,'' "                                                                        
## [149] "John Wiley & Sons, Inc., 162-189."                                                                                               
## [150] "Comments on 'Pricing Excess-of-Loss Reinsurance Contracts against"                                                               
## [151] "Catastrophic Loss,' by J. David Cummins, C. Lewis, and Richard Phillips,"                                                        
## [152] "Froot (Ed.), University of Chicago Press, 1999, 141-145."                                                                        
## [153] "  Pricing Credit Derivatives,'' "                                                                                                
## [154] "J. Frost and J.G. Whittaker, 101-138."                                                                                           
## [155] "On the Recursive Implementation of Term Structure Models,'' "                                                                    
## [156] "”Local Volatility and the Recovery Rate of Credit Default Swaps”, "                                                              
## [157] "(with Jeroen Jansen and Frank Fabozzi)."                                                                                         
## [158] "Efficient Rebalancing of Taxable Portfolios (with Dan Ostrov, Dennis Ding, Vincent Newell), "                                    
## [159] "The Fast and the Curious: VC Drift "                                                                                             
## [160] "(with Amit Bubna and Paul Hanouna), "                                                                                            
## [161] "Venture Capital Communities (with Amit Bubna and Nagpurnanand Prabhala), "                                                       
## [162] "                                                "

Take a look at the text now to see how cleaned up it is. But there is a better way, i.e., use the text-mining package tm.

Text Mining with the tm Package

  1. The R programming language supports a text-mining package, succinctly named {tm}. Using functions such as {readDOC()}, {readPDF()}, etc., for reading DOC and PDF files, the package makes accessing various file formats easy.

  2. Text mining involves applying functions to many text documents. A library of text documents (irrespective of format) is called a corpus. The essential and highly useful feature of text mining packages is the ability to operate on the entire set of documents at one go.

library(tm)
## Loading required package: NLP
text = c("INTL is expected to announce good earnings report", "AAPL first quarter disappoints","GOOG announces new wallet", "YHOO ascends from old ways")
text_corpus = Corpus(VectorSource(text))
print(text_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4
writeCorpus(text_corpus)

The writeCorpus() function in tm creates separate text files on the hard drive, and by default are names 1.txt, 2.txt, etc. The simple program code above shows how text scraped off a web page and collapsed into a single character string for each document, may then be converted into a corpus of documents using the Corpus() function.

It is easy to inspect the corpus as follows:

inspect(text_corpus)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 49
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 30
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 25
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 26

A second example

Here we use lapply to inspect the contents of the corpus.

#USING THE tm PACKAGE
library(tm)
text = c("Doc1;","This is doc2 --", "And, then Doc3.")
ctext = Corpus(VectorSource(text))
ctext
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
#writeCorpus(ctext)

#THE CORPUS IS A LIST OBJECT in R of type VCorpus or Corpus
inspect(ctext)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 5
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15
print(as.character(ctext[[1]]))
## [1] "Doc1;"
print(lapply(ctext[1:2],as.character))
## $`1`
## [1] "Doc1;"
## 
## $`2`
## [1] "This is doc2 --"
ctext = tm_map(ctext,tolower)  #Lower case all text in all docs
inspect(ctext)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] doc1;
## 
## [[2]]
## [1] this is doc2 --
## 
## [[3]]
## [1] and, then doc3.
ctext2 = tm_map(ctext,toupper)
inspect(ctext2)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] DOC1;
## 
## [[2]]
## [1] THIS IS DOC2 --
## 
## [[3]]
## [1] AND, THEN DOC3.

Function tm_map

#FIRST CURATE TO UPPER CASE
dropWords = c("IS","AND","THEN")
ctext2 = tm_map(ctext2,removeWords,dropWords)
inspect(ctext2)
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## [1] DOC1;
## 
## [[2]]
## [1] THIS  DOC2 --
## 
## [[3]]
## [1] ,  DOC3.
ctext = Corpus(VectorSource(text))
temp = ctext
print(lapply(temp,as.character))
## $`1`
## [1] "Doc1;"
## 
## $`2`
## [1] "This is doc2 --"
## 
## $`3`
## [1] "And, then Doc3."
temp = tm_map(temp,removeWords,stopwords("english"))
print(lapply(temp,as.character))
## $`1`
## [1] "Doc1;"
## 
## $`2`
## [1] "This  doc2 --"
## 
## $`3`
## [1] "And,  Doc3."
temp = tm_map(temp,removePunctuation)
print(lapply(temp,as.character))
## $`1`
## [1] "Doc1"
## 
## $`2`
## [1] "This  doc2 "
## 
## $`3`
## [1] "And  Doc3"
temp = tm_map(temp,removeNumbers)
print(lapply(temp,as.character))
## $`1`
## [1] "Doc"
## 
## $`2`
## [1] "This  doc "
## 
## $`3`
## [1] "And  Doc"

Bag of Words

We can create a bag of words by collapsing all the text into one bundle.

#CONVERT CORPUS INTO ARRAY OF STRINGS AND FLATTEN
txt = NULL
for (j in 1:length(temp)) {
  txt = c(txt,temp[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)
## [1] "doc this  doc  and  doc"

Example (on my bio page)

Now we will do a full pass through of this on my bio.

text = readLines("http://srdas.github.io/bio-candid.html")
ctext = Corpus(VectorSource(text))
ctext
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 79
print(lapply(ctext, as.character))
## $`1`
## [1] "<HTML>"
## 
## $`2`
## [1] "<BODY background=\"http://algo.scu.edu/~sanjivdas/graphics/back2.gif\">"
## 
## $`3`
## [1] ""
## 
## $`4`
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
## 
## $`5`
## [1] "Santa Clara University's Leavey School of Business. He previously held"
## 
## $`6`
## [1] "faculty appointments as Associate Professor at Harvard Business School"
## 
## $`7`
## [1] "and UC Berkeley. He holds post-graduate degrees in Finance (M.Phil and"
## 
## $`8`
## [1] "Ph.D. from New York University), Computer Science (M.S. from UC"
## 
## $`9`
## [1] "Berkeley), an MBA from the Indian Institute of Management, Ahmedabad,"
## 
## $`10`
## [1] "B.Com in Accounting and Economics (University of Bombay, Sydenham"
## 
## $`11`
## [1] "College), and is also a qualified Cost and Works Accountant. He is a"
## 
## $`12`
## [1] "senior editor of The Journal of Investment Management, co-editor of"
## 
## $`13`
## [1] "The Journal of Derivatives and The Journal of Financial Services"
## 
## $`14`
## [1] "Research, and Associate Editor of other academic journals. Prior to"
## 
## $`15`
## [1] "being an academic, he worked in the derivatives business in the"
## 
## $`16`
## [1] "Asia-Pacific region as a Vice-President at Citibank. His current"
## 
## $`17`
## [1] "research interests include: the modeling of default risk, machine"
## 
## $`18`
## [1] "learning, social networks, derivatives pricing models, portfolio"
## 
## $`19`
## [1] "theory, and venture capital. He has published over ninety articles in"
## 
## $`20`
## [1] "academic journals, and has won numerous awards for research and"
## 
## $`21`
## [1] "teaching. His recent book \"Derivatives: Principles and Practice\" was"
## 
## $`22`
## [1] "published in May 2010.  He currently also serves as a Senior Fellow at"
## 
## $`23`
## [1] "the FDIC Center for Financial Research."
## 
## $`24`
## [1] ""
## 
## $`25`
## [1] ""
## 
## $`26`
## [1] "<p> <B>Sanjiv Das: A Short Academic Life History</B> <p>"
## 
## $`27`
## [1] ""
## 
## $`28`
## [1] "After loafing and working in many parts of Asia, but never really"
## 
## $`29`
## [1] "growing up, Sanjiv moved to New York to change the world, hopefully"
## 
## $`30`
## [1] "through research.  He graduated in 1994 with a Ph.D. from NYU, and"
## 
## $`31`
## [1] "since then spent five years in Boston, and now lives in San Jose,"
## 
## $`32`
## [1] "California.  Sanjiv loves animals, places in the world where the"
## 
## $`33`
## [1] "mountains meet the sea, riding sport motorbikes, reading, gadgets,"
## 
## $`34`
## [1] "science fiction movies, and writing cool software code. When there is"
## 
## $`35`
## [1] "time available from the excitement of daily life, Sanjiv writes"
## 
## $`36`
## [1] "academic papers, which helps him relax. Always the contrarian, Sanjiv"
## 
## $`37`
## [1] "thinks that New York City is the most calming place in the world,"
## 
## $`38`
## [1] "after California of course."
## 
## $`39`
## [1] ""
## 
## $`40`
## [1] "<p>"
## 
## $`41`
## [1] ""
## 
## $`42`
## [1] "Sanjiv is now a Professor of Finance at Santa Clara University. He came"
## 
## $`43`
## [1] "to SCU from Harvard Business School and spent a year at UC Berkeley. In"
## 
## $`44`
## [1] "his past life in the unreal world, Sanjiv worked at Citibank, N.A. in"
## 
## $`45`
## [1] "the Asia-Pacific region. He takes great pleasure in merging his many"
## 
## $`46`
## [1] "previous lives into his current existence, which is incredibly confused"
## 
## $`47`
## [1] "and diverse."
## 
## $`48`
## [1] ""
## 
## $`49`
## [1] "<p>"
## 
## $`50`
## [1] ""
## 
## $`51`
## [1] "Sanjiv's research style is instilled with a distinct \"New York state of"
## 
## $`52`
## [1] "mind\" - it is chaotic, diverse, with minimal method to the madness. He"
## 
## $`53`
## [1] "has published articles on derivatives, term-structure models, mutual"
## 
## $`54`
## [1] "funds, the internet, portfolio choice, banking models, credit risk, and"
## 
## $`55`
## [1] "has unpublished articles in many other areas. Some years ago, he took"
## 
## $`56`
## [1] "time off to get another degree in computer science at Berkeley,"
## 
## $`57`
## [1] "confirming that an unchecked hobby can quickly become an obsession."
## 
## $`58`
## [1] "There he learnt about the fascinating field of Randomized Algorithms,"
## 
## $`59`
## [1] "skills he now applies earnestly to his editorial work, and other"
## 
## $`60`
## [1] "pursuits, many of which stem from being in the epicenter of Silicon"
## 
## $`61`
## [1] "Valley."
## 
## $`62`
## [1] ""
## 
## $`63`
## [1] "<p>"
## 
## $`64`
## [1] ""
## 
## $`65`
## [1] "Coastal living did a lot to mold Sanjiv, who needs to live near the"
## 
## $`66`
## [1] "ocean.  The many walks in Greenwich village convinced him that there is"
## 
## $`67`
## [1] "no such thing as a representative investor, yet added many unique"
## 
## $`68`
## [1] "features to his personal utility function. He learnt that it is"
## 
## $`69`
## [1] "important to open the academic door to the ivory tower and let the world"
## 
## $`70`
## [1] "in. Academia is a real challenge, given that he has to reconcile many"
## 
## $`71`
## [1] "more opinions than ideas. He has been known to have turned down many"
## 
## $`72`
## [1] "offers from Mad magazine to publish his academic work. As he often"
## 
## $`73`
## [1] "explains, you never really finish your education - \"you can check out"
## 
## $`74`
## [1] "any time you like, but you can never leave.\" Which is why he is doomed"
## 
## $`75`
## [1] "to a lifetime in Hotel California. And he believes that, if this is as"
## 
## $`76`
## [1] "bad as it gets, life is really pretty good."
## 
## $`77`
## [1] ""
## 
## $`78`
## [1] ""
## 
## $`79`
## [1] ""
ctext = tm_map(ctext,removePunctuation)
print(lapply(ctext, as.character))
## $`1`
## [1] "HTML"
## 
## $`2`
## [1] "BODY backgroundhttpalgoscuedusanjivdasgraphicsback2gif"
## 
## $`3`
## [1] ""
## 
## $`4`
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at"
## 
## $`5`
## [1] "Santa Clara Universitys Leavey School of Business He previously held"
## 
## $`6`
## [1] "faculty appointments as Associate Professor at Harvard Business School"
## 
## $`7`
## [1] "and UC Berkeley He holds postgraduate degrees in Finance MPhil and"
## 
## $`8`
## [1] "PhD from New York University Computer Science MS from UC"
## 
## $`9`
## [1] "Berkeley an MBA from the Indian Institute of Management Ahmedabad"
## 
## $`10`
## [1] "BCom in Accounting and Economics University of Bombay Sydenham"
## 
## $`11`
## [1] "College and is also a qualified Cost and Works Accountant He is a"
## 
## $`12`
## [1] "senior editor of The Journal of Investment Management coeditor of"
## 
## $`13`
## [1] "The Journal of Derivatives and The Journal of Financial Services"
## 
## $`14`
## [1] "Research and Associate Editor of other academic journals Prior to"
## 
## $`15`
## [1] "being an academic he worked in the derivatives business in the"
## 
## $`16`
## [1] "AsiaPacific region as a VicePresident at Citibank His current"
## 
## $`17`
## [1] "research interests include the modeling of default risk machine"
## 
## $`18`
## [1] "learning social networks derivatives pricing models portfolio"
## 
## $`19`
## [1] "theory and venture capital He has published over ninety articles in"
## 
## $`20`
## [1] "academic journals and has won numerous awards for research and"
## 
## $`21`
## [1] "teaching His recent book Derivatives Principles and Practice was"
## 
## $`22`
## [1] "published in May 2010  He currently also serves as a Senior Fellow at"
## 
## $`23`
## [1] "the FDIC Center for Financial Research"
## 
## $`24`
## [1] ""
## 
## $`25`
## [1] ""
## 
## $`26`
## [1] "p BSanjiv Das A Short Academic Life HistoryB p"
## 
## $`27`
## [1] ""
## 
## $`28`
## [1] "After loafing and working in many parts of Asia but never really"
## 
## $`29`
## [1] "growing up Sanjiv moved to New York to change the world hopefully"
## 
## $`30`
## [1] "through research  He graduated in 1994 with a PhD from NYU and"
## 
## $`31`
## [1] "since then spent five years in Boston and now lives in San Jose"
## 
## $`32`
## [1] "California  Sanjiv loves animals places in the world where the"
## 
## $`33`
## [1] "mountains meet the sea riding sport motorbikes reading gadgets"
## 
## $`34`
## [1] "science fiction movies and writing cool software code When there is"
## 
## $`35`
## [1] "time available from the excitement of daily life Sanjiv writes"
## 
## $`36`
## [1] "academic papers which helps him relax Always the contrarian Sanjiv"
## 
## $`37`
## [1] "thinks that New York City is the most calming place in the world"
## 
## $`38`
## [1] "after California of course"
## 
## $`39`
## [1] ""
## 
## $`40`
## [1] "p"
## 
## $`41`
## [1] ""
## 
## $`42`
## [1] "Sanjiv is now a Professor of Finance at Santa Clara University He came"
## 
## $`43`
## [1] "to SCU from Harvard Business School and spent a year at UC Berkeley In"
## 
## $`44`
## [1] "his past life in the unreal world Sanjiv worked at Citibank NA in"
## 
## $`45`
## [1] "the AsiaPacific region He takes great pleasure in merging his many"
## 
## $`46`
## [1] "previous lives into his current existence which is incredibly confused"
## 
## $`47`
## [1] "and diverse"
## 
## $`48`
## [1] ""
## 
## $`49`
## [1] "p"
## 
## $`50`
## [1] ""
## 
## $`51`
## [1] "Sanjivs research style is instilled with a distinct New York state of"
## 
## $`52`
## [1] "mind  it is chaotic diverse with minimal method to the madness He"
## 
## $`53`
## [1] "has published articles on derivatives termstructure models mutual"
## 
## $`54`
## [1] "funds the internet portfolio choice banking models credit risk and"
## 
## $`55`
## [1] "has unpublished articles in many other areas Some years ago he took"
## 
## $`56`
## [1] "time off to get another degree in computer science at Berkeley"
## 
## $`57`
## [1] "confirming that an unchecked hobby can quickly become an obsession"
## 
## $`58`
## [1] "There he learnt about the fascinating field of Randomized Algorithms"
## 
## $`59`
## [1] "skills he now applies earnestly to his editorial work and other"
## 
## $`60`
## [1] "pursuits many of which stem from being in the epicenter of Silicon"
## 
## $`61`
## [1] "Valley"
## 
## $`62`
## [1] ""
## 
## $`63`
## [1] "p"
## 
## $`64`
## [1] ""
## 
## $`65`
## [1] "Coastal living did a lot to mold Sanjiv who needs to live near the"
## 
## $`66`
## [1] "ocean  The many walks in Greenwich village convinced him that there is"
## 
## $`67`
## [1] "no such thing as a representative investor yet added many unique"
## 
## $`68`
## [1] "features to his personal utility function He learnt that it is"
## 
## $`69`
## [1] "important to open the academic door to the ivory tower and let the world"
## 
## $`70`
## [1] "in Academia is a real challenge given that he has to reconcile many"
## 
## $`71`
## [1] "more opinions than ideas He has been known to have turned down many"
## 
## $`72`
## [1] "offers from Mad magazine to publish his academic work As he often"
## 
## $`73`
## [1] "explains you never really finish your education  you can check out"
## 
## $`74`
## [1] "any time you like but you can never leave Which is why he is doomed"
## 
## $`75`
## [1] "to a lifetime in Hotel California And he believes that if this is as"
## 
## $`76`
## [1] "bad as it gets life is really pretty good"
## 
## $`77`
## [1] ""
## 
## $`78`
## [1] ""
## 
## $`79`
## [1] ""
txt = NULL
for (j in 1:length(ctext)) {
  txt = c(txt,ctext[[j]]$content)
}
txt = paste(txt,collapse=" ")
txt = tolower(txt)
print(txt)
## [1] "html body backgroundhttpalgoscuedusanjivdasgraphicsback2gif  sanjiv das is the william and janice terry professor of finance at santa clara universitys leavey school of business he previously held faculty appointments as associate professor at harvard business school and uc berkeley he holds postgraduate degrees in finance mphil and phd from new york university computer science ms from uc berkeley an mba from the indian institute of management ahmedabad bcom in accounting and economics university of bombay sydenham college and is also a qualified cost and works accountant he is a senior editor of the journal of investment management coeditor of the journal of derivatives and the journal of financial services research and associate editor of other academic journals prior to being an academic he worked in the derivatives business in the asiapacific region as a vicepresident at citibank his current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital he has published over ninety articles in academic journals and has won numerous awards for research and teaching his recent book derivatives principles and practice was published in may 2010  he currently also serves as a senior fellow at the fdic center for financial research   p bsanjiv das a short academic life historyb p  after loafing and working in many parts of asia but never really growing up sanjiv moved to new york to change the world hopefully through research  he graduated in 1994 with a phd from nyu and since then spent five years in boston and now lives in san jose california  sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code when there is time available from the excitement of daily life sanjiv writes academic papers which helps him relax always the contrarian sanjiv thinks that new york city is the most calming place in the world after california of course  p  sanjiv is now a professor of finance at santa clara university he came to scu from harvard business school and spent a year at uc berkeley in his past life in the unreal world sanjiv worked at citibank na in the asiapacific region he takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse  p  sanjivs research style is instilled with a distinct new york state of mind  it is chaotic diverse with minimal method to the madness he has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas some years ago he took time off to get another degree in computer science at berkeley confirming that an unchecked hobby can quickly become an obsession there he learnt about the fascinating field of randomized algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of silicon valley  p  coastal living did a lot to mold sanjiv who needs to live near the ocean  the many walks in greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function he learnt that it is important to open the academic door to the ivory tower and let the world in academia is a real challenge given that he has to reconcile many more opinions than ideas he has been known to have turned down many offers from mad magazine to publish his academic work as he often explains you never really finish your education  you can check out any time you like but you can never leave which is why he is doomed to a lifetime in hotel california and he believes that if this is as bad as it gets life is really pretty good   "

Term Document Matrix (TDM)

An extremeley important object in text analysis is the Term-Document Matrix. This allows us to store an entire library of text inside a single matrix. This may then be used for analysis as well as searching documents. It forms the basis of search engines, topic analysis, and classification (spam filtering).

It is a table that provides the frequency count of every word (term) in each document. The number of rows in the TDM is equal to the number of unique terms, and the number of columns is equal to the number of documents.

#TERM-DOCUMENT MATRIX
tdm = TermDocumentMatrix(ctext,control=list(minWordLength=1))
print(tdm)
## <<TermDocumentMatrix (terms: 317, documents: 79)>>
## Non-/sparse entries: 497/24546
## Sparsity           : 98%
## Maximal term length: 49
## Weighting          : term frequency (tf)
inspect(tdm[10:20,11:18])
## <<TermDocumentMatrix (terms: 11, documents: 8)>>
## Non-/sparse entries: 4/84
## Sparsity           : 95%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## 
##               Docs
## Terms          11 12 13 14 15 16 17 18
##   ago           0  0  0  0  0  0  0  0
##   ahmedabad     0  0  0  0  0  0  0  0
##   algorithms    0  0  0  0  0  0  0  0
##   also          1  0  0  0  0  0  0  0
##   always        0  0  0  0  0  0  0  0
##   and           2  0  1  1  0  0  0  0
##   animals       0  0  0  0  0  0  0  0
##   another       0  0  0  0  0  0  0  0
##   any           0  0  0  0  0  0  0  0
##   applies       0  0  0  0  0  0  0  0
##   appointments  0  0  0  0  0  0  0  0
out = findFreqTerms(tdm,lowfreq=5)
print(out)
##  [1] "academic"    "and"         "derivatives" "from"        "has"        
##  [6] "his"         "many"        "research"    "sanjiv"      "that"       
## [11] "the"         "world"

Term Frequency - Inverse Document Frequency (TF-IDF)

This is a weighting scheme provided to sharpen the importance of rare words in a document, relative to the frequency of these words in the corpus. It is based on simple calculations and even though it does not have strong theoretical foundations, it is still very useful in practice. The TF-IDF is the importance of a word \(w\) in a document \(d\) in a corpus \(C\). Therefore it is a function of all these three, i.e., we write it as TF-IDF\((w,d,C)\), and is the product of term frequency (TF) and inverse document frequency (IDF).

The frequency of a word in a document is defined as \[ f(w,d) = \frac{\#w \in d}{|d|} \] where \(|d|\) is the number of words in the document. We usually normalize word frequency so that \[ TF(w,d) = \ln[f(w,d)] \] This is log normalization. Another form of normalization is known as double normalization and is as follows: \[ TF(w,d) = \frac{1}{2} + \frac{1}{2} \frac{f(w,d)}{\max_{w \in d} f(w,d)} \] Note that normalization is not necessary, but it tends to help shrink the difference between counts of words.

Inverse document frequency is as follows: \[ IDF(w,C) = \ln\left[ \frac{|C|}{|d_{w \in d}|} \right] \] That is, we compute the ratio of the number of documents in the corpus \(C\) divided by the number of documents with word \(w\) in the corpus.

Finally, we have the weighting score for a given word \(w\) in document \(d\) in corpus \(C\): \[ \mbox{TF-IDF}(w,d,C) = TF(w,d) \times IDF(w,C) \]

Example of TD-IDF

We illustrate this with an application to the previously computed term-document matrix.

tdm_mat = as.matrix(tdm)  #Convert tdm into a matrix
print(dim(tdm_mat))
## [1] 317  79
nw = dim(tdm_mat)[1]
nd = dim(tdm_mat)[2]
doc = 13   #Choose document
word = "derivatives"   #Choose word

#COMPUTE TF
f = NULL
for (w in row.names(tdm_mat)) {
    f = c(f,tdm_mat[w,doc]/sum(tdm_mat[,doc]))
}
fw = tdm_mat[word,doc]/sum(tdm_mat[,doc])
TF = 0.5 + 0.5*fw/max(f)
print(TF)
## [1] 0.75
#COMPUTE IDF
nw = length(which(tdm_mat[word,]>0))
print(nw)
## [1] 5
IDF = nd/nw
print(IDF)
## [1] 15.8
#COMPUTE TF-IDF
TF_IDF = TF*IDF
print(TF_IDF)  #With normalization
## [1] 11.85
print(fw*IDF)   #Without normalization
## [1] 1.975

We can write this code into a function and work out the TF-IDF for all words. Then these word weights may be used in further text analysis.

TF-IDF in the tm package

We may also directly use the weightTfIdf function in the tm package. This undertakes the following computation:

Example:

library(tm)
textarray = c("Free software comes with ABSOLUTELY NO certain WARRANTY","You are welcome to redistribute free software under certain conditions","Natural language support for software in an English locale","A collaborative project with many contributors")
textcorpus = Corpus(VectorSource(textarray))
m = TermDocumentMatrix(textcorpus)
print(as.matrix(m))
##                Docs
## Terms           1 2 3 4
##   absolutely    1 0 0 0
##   are           0 1 0 0
##   certain       1 1 0 0
##   collaborative 0 0 0 1
##   comes         1 0 0 0
##   conditions    0 1 0 0
##   contributors  0 0 0 1
##   english       0 0 1 0
##   for           0 0 1 0
##   free          1 1 0 0
##   language      0 0 1 0
##   locale        0 0 1 0
##   many          0 0 0 1
##   natural       0 0 1 0
##   project       0 0 0 1
##   redistribute  0 1 0 0
##   software      1 1 1 0
##   support       0 0 1 0
##   under         0 1 0 0
##   warranty      1 0 0 0
##   welcome       0 1 0 0
##   with          1 0 0 1
##   you           0 1 0 0
print(as.matrix(weightTfIdf(m)))
##                Docs
## Terms                    1          2          3   4
##   absolutely    0.28571429 0.00000000 0.00000000 0.0
##   are           0.00000000 0.22222222 0.00000000 0.0
##   certain       0.14285714 0.11111111 0.00000000 0.0
##   collaborative 0.00000000 0.00000000 0.00000000 0.4
##   comes         0.28571429 0.00000000 0.00000000 0.0
##   conditions    0.00000000 0.22222222 0.00000000 0.0
##   contributors  0.00000000 0.00000000 0.00000000 0.4
##   english       0.00000000 0.00000000 0.28571429 0.0
##   for           0.00000000 0.00000000 0.28571429 0.0
##   free          0.14285714 0.11111111 0.00000000 0.0
##   language      0.00000000 0.00000000 0.28571429 0.0
##   locale        0.00000000 0.00000000 0.28571429 0.0
##   many          0.00000000 0.00000000 0.00000000 0.4
##   natural       0.00000000 0.00000000 0.28571429 0.0
##   project       0.00000000 0.00000000 0.00000000 0.4
##   redistribute  0.00000000 0.22222222 0.00000000 0.0
##   software      0.05929107 0.04611528 0.05929107 0.0
##   support       0.00000000 0.00000000 0.28571429 0.0
##   under         0.00000000 0.22222222 0.00000000 0.0
##   warranty      0.28571429 0.00000000 0.00000000 0.0
##   welcome       0.00000000 0.22222222 0.00000000 0.0
##   with          0.14285714 0.00000000 0.00000000 0.2
##   you           0.00000000 0.22222222 0.00000000 0.0

Cosine Similarity in the Text Domain

In this segment we will learn some popular functions on text that are used in practice. One of the first things we like to do is to find similar text or like sentences (think of web search as one application). Since documents are vectors in the TDM, we may want to find the closest vectors or compute the distance between vectors.

\[ cos(\theta) = \frac{A \cdot B}{||A|| \times ||B||} \]

where \(||A|| = \sqrt{A \cdot A}\), is the dot product of \(A\) with itself, also known as the norm of \(A\). This gives the cosine of the angle between the two vectors and is zero for orthogonal vectors and 1 for identical vectors.

#COSINE DISTANCE OR SIMILARITY
A = as.matrix(c(0,3,4,1,7,0,1))
B = as.matrix(c(0,4,3,0,6,1,1))
cos = t(A) %*% B / (sqrt(t(A)%*%A) * sqrt(t(B)%*%B))
print(cos)
##           [,1]
## [1,] 0.9682728
library(lsa)
## Loading required package: SnowballC
#THE COSINE FUNCTION IN LSA ONLY TAKES ARRAYS
A = c(0,3,4,1,7,0,1)
B = c(0,4,3,0,6,1,1)
print(cosine(A,B))
##           [,1]
## [1,] 0.9682728

Using the ANLP package for bigrams and trigrams

This package has a few additional functions that make the preceding ideas more streamlined to implement. First let’s read in the usual text.

library(ANLP)
## Warning: package 'ANLP' was built under R version 3.2.5
## Loading required package: qdap
## Warning: package 'qdap' was built under R version 3.2.5
## Loading required package: qdapDictionaries
## Loading required package: qdapRegex
## Loading required package: qdapTools
## Loading required package: RColorBrewer
## 
## Attaching package: 'qdap'
## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, as.TermDocumentMatrix
## The following object is masked from 'package:NLP':
## 
##     ngrams
## The following object is masked from 'package:stringr':
## 
##     %>%
## The following object is masked from 'package:base':
## 
##     Filter
## Loading required package: RWeka
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.2.5
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:qdap':
## 
##     %>%
## The following object is masked from 'package:qdapTools':
## 
##     id
## The following objects are masked from 'package:qdapRegex':
## 
##     escape, explain
## The following object is masked from 'package:lsa':
## 
##     query
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: replacing previous import by 'tm::TermDocumentMatrix' when loading
## 'ANLP'
download.file("http://srdas.github.io/bio-candid.html",destfile = "text")
text = readTextFile("text","UTF-8")
ctext = cleanTextData(text)  #Creates a text corpus

The last function removes non-english characters, numbers, white spaces, brackets, punctuation. It also handles cases like abbreviation, contraction. It converts entire text to lower case.

We now make TDMs for unigrams, bigrams, trigrams. Then, combine them all into one list for word prediction.

g1 = generateTDM(ctext,1)
g2 = generateTDM(ctext,2)
g3 = generateTDM(ctext,3)
gmodel = list(g1,g2,g3)

Next, use the back-off algorithm to predict the next sequence of words.

print(predict_Backoff("you never",gmodel))
## [1] "leave"
print(predict_Backoff("life is",gmodel))
## [1] "the"
print(predict_Backoff("been known",gmodel))
## [1] "to"
print(predict_Backoff("needs to",gmodel))
## [1] "his"
print(predict_Backoff("worked at",gmodel))
## [1] "citibank"
print(predict_Backoff("being an",gmodel))
## [1] "unchecked"
print(predict_Backoff("publish",gmodel))
## [1] "in"

Wordclouds

Wordlcouds are interesting ways in which to represent text. They give an instant visual summary. The wordcloud package in R may be used to create your own wordclouds.

#MAKE A WORDCLOUD
library(wordcloud)
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)

#REMOVE STOPWORDS, NUMBERS, STEMMING
ctext1 = tm_map(ctext,removeWords,stopwords("english"))
ctext1 = tm_map(ctext1, removeNumbers)
tdm = TermDocumentMatrix(ctext1,control=list(minWordLength=1))
tdm2 = as.matrix(tdm)
wordcount = sort(rowSums(tdm2),decreasing=TRUE)
tdm_names = names(wordcount)
wordcloud(tdm_names,wordcount)

Manipulating Text

Stemming

Stemming is the procedure by which a word is reduced to its root or stem. This is done so as to treat words from the one stem as the same word, rather than as separate words. We do not want “eaten” and “eating” to be treated as different words for example.

#STEMMING
ctext2 = tm_map(ctext,removeWords,stopwords("english"))
ctext2 = tm_map(ctext2, stemDocument)
print(lapply(ctext2, as.character))
## $`1`
##  [1] ""                                                         
##  [2] ""                                                         
##  [3] ""                                                         
##  [4] "sanjiv das   william  janic terri professor  financ"      
##  [5] "santa clara univers leavey school  busi  previous held"   
##  [6] "faculti appoint  associ professor  harvard busi school"   
##  [7] " uc berkeley  hold postgradu degre  financ mphil"         
##  [8] "phd  new york univers comput scienc ms  uc"               
##  [9] "berkeley  mba   indian institut  manag ahmedabad"         
## [10] "bcom  account  econom univers  bombay sydenham"           
## [11] "colleg   also  qualifi cost  work account  "              
## [12] "senior editor   journal  invest manag coeditor"           
## [13] " journal  deriv   journal  financi servic"                
## [14] "research  associ editor   academ journal prior"           
## [15] "  academ  work   deriv busi "                             
## [16] "asiapacif region   vicepresid  citibank  current"         
## [17] "research interest includ  model  default risk machin"     
## [18] "learn social network deriv price model portfolio"         
## [19] "theori  ventur capit   publish  nineti articl"            
## [20] "academ journal   won numer award  research"               
## [21] "teach  recent book deriv principl  practic"               
## [22] "publish  may  current also serv   senior fellow"          
## [23] " fdic center  financi research"                           
## [24] ""                                                         
## [25] ""                                                         
## [26] "sanjiv das  short academ life histori"                    
## [27] ""                                                         
## [28] " loaf  work  mani part  asia  never realli"               
## [29] "grow  sanjiv move  new york  chang  world hope"           
## [30] " research  graduat    phd  nyu"                           
## [31] "sinc  spent five year  boston  now live  san jose"        
## [32] "california sanjiv love anim place   world "               
## [33] "mountain meet  sea ride sport motorbik read gadget"       
## [34] "scienc fiction movi  write cool softwar code  "           
## [35] "time avail   excit  daili life sanjiv write"              
## [36] "academ paper  help  relax alway  contrarian sanjiv"       
## [37] "think  new york citi    calm place   world"               
## [38] " california  cours"                                       
## [39] ""                                                         
## [40] ""                                                         
## [41] ""                                                         
## [42] "sanjiv  now  professor  financ  santa clara univers  came"
## [43] " scu  harvard busi school  spent  year  uc berkeley"      
## [44] " past life   unreal world sanjiv work  citibank na"       
## [45] " asiapacif region  take great pleasur  merg  mani"        
## [46] "previous live   current exist   incred confus"            
## [47] " divers"                                                  
## [48] ""                                                         
## [49] ""                                                         
## [50] ""                                                         
## [51] "sanjiv research style  instil   distinct new york state"  
## [52] "mind   chaotic divers  minim method   mad"                
## [53] " publish articl  deriv termstructur model mutual"         
## [54] "fund  internet portfolio choic bank model credit risk"    
## [55] " unpublish articl  mani  area  year ago  took"            
## [56] "time   get anoth degre  comput scienc  berkeley"          
## [57] "confirm   uncheck hobbi can quick becom  obsess"          
## [58] "  learnt   fascin field  random algorithm"                
## [59] "skill  now appli earnest   editori work "                 
## [60] "pursuit mani   stem     epicent  silicon"                 
## [61] "valley"                                                   
## [62] ""                                                         
## [63] ""                                                         
## [64] ""                                                         
## [65] "coastal live   lot  mold sanjiv  need  live near"         
## [66] "ocean  mani walk  greenwich villag convinc   "            
## [67] "  thing   repres investor yet ad mani uniqu"              
## [68] "featur   person util function  learnt  "                  
## [69] "import  open  academ door   ivori tower  let  world"      
## [70] " academia   real challeng given     reconcil mani"        
## [71] " opinion  idea    known   turn  mani"                     
## [72] "offer  mad magazin  publish  academ work   often"         
## [73] "explain  never realli finish  educ  can check"            
## [74] " time  like   can never leav      doom"                   
## [75] "  lifetim  hotel california   believ    "                 
## [76] "bad   get life  realli pretti good"                       
## [77] ""                                                         
## [78] ""                                                         
## [79] ""

Regular Expressions

Regular expressions are syntax used to modify strings in an efficient manner. They are complicated but extremely effective. Here we will illustrate with a few examples, but you are encouraged to explore more on your own, as the variations are endless. What you need to do will depend on the application at hand, and with some experience you will become better at using regular expressions. The initial use will however be somewhat confusing.

We start with a simple example of a text array where we wish replace the string “data” with a blank, i.e., we eliminate this string from the text we have.

library(tm)
#Create a text array
text = c("Doc1 is datavision","Doc2 is datatable","Doc3 is data","Doc4 is nodata","Doc5 is simpler")
print(text)
## [1] "Doc1 is datavision" "Doc2 is datatable"  "Doc3 is data"      
## [4] "Doc4 is nodata"     "Doc5 is simpler"
#Remove all strings with the chosen text for all docs
print(gsub("data","",text))
## [1] "Doc1 is vision"  "Doc2 is table"   "Doc3 is "        "Doc4 is no"     
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the start even if they are longer than data
print(gsub("*data.*","",text))
## [1] "Doc1 is "        "Doc2 is "        "Doc3 is "        "Doc4 is no"     
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data*","",text))
## [1] "Doc1 isvision"   "Doc2 istable"    "Doc3 is"         "Doc4 is n"      
## [5] "Doc5 is simpler"
#Remove all words that contain "data" at the end even if they are longer than data
print(gsub("*.data.*","",text))
## [1] "Doc1 is"         "Doc2 is"         "Doc3 is"         "Doc4 is n"      
## [5] "Doc5 is simpler"

Complex Regular Expressions using grep

We now explore some more complex regular expressions. One case that is common is handling the search for special types of strings like telephone numbers. Suppose we have a text array that may contain telephone numbers in different formats, we can use a single grep command to extract these numbers. Here is some code to illustrate this.

#Create an array with some strings which may also contain telephone numbers as strings. 
x = c("234-5678","234 5678","2345678","1234567890","0123456789","abc 234-5678","234 5678 def","xx 2345678","abc1234567890def")

#Now use grep to find which elements of the array contain telephone numbers
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]",x)
print(idx)
## [1] 1 2 4 6 7 9
print(x[idx])
## [1] "234-5678"         "234 5678"         "1234567890"      
## [4] "abc 234-5678"     "234 5678 def"     "abc1234567890def"
#We can shorten this as follows
idx = grep("[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}",x)
print(idx)
## [1] 1 2 4 6 7 9
print(x[idx])
## [1] "234-5678"         "234 5678"         "1234567890"      
## [4] "abc 234-5678"     "234 5678 def"     "abc1234567890def"
#What if we want to extract only the phone number and drop the rest of the text?
pattern = "[[:digit:]]{3}-[[:digit:]]{4}|[[:digit:]]{3} [[:digit:]]{4}|[1-9][0-9]{9}"
print(regmatches(x, gregexpr(pattern,x)))
## [[1]]
## [1] "234-5678"
## 
## [[2]]
## [1] "234 5678"
## 
## [[3]]
## character(0)
## 
## [[4]]
## [1] "1234567890"
## 
## [[5]]
## character(0)
## 
## [[6]]
## [1] "234-5678"
## 
## [[7]]
## [1] "234 5678"
## 
## [[8]]
## character(0)
## 
## [[9]]
## [1] "1234567890"
#Or use the stringr package, which is a lot better
library(stringr)
str_extract(x,pattern)
## [1] "234-5678"   "234 5678"   NA           "1234567890" NA          
## [6] "234-5678"   "234 5678"   NA           "1234567890"

Using grep for emails

Now we use grep to extract emails by looking for the “@” sign in the text string. We would proceed as in the following example.

x = c("sanjiv das","srdas@scu.edu","SCU","data@science.edu")
print(grep("\\@",x))
## [1] 2 4
print(x[grep("\\@",x)])
## [1] "srdas@scu.edu"    "data@science.edu"

You get the idea. Using the functions gsub, grep, regmatches, and gregexpr, you can manage most fancy string handling that is needed.

Web Extraction

Using the rvest package: Overview

The rvest package, written bu Hadley Wickham, is a powerful tool for extracting text from web pages. The package provides wrappers around the ‘xml2’ and ‘httr’ packages to make it easy to download, and then manipulate, HTML and XML. The package is best illustrated with some simple examples.

Program to read a web page using the selector gadget

The selector gadget ius a useful tool to be used in conjunction with the rvest package. It allows you to find the html tag in a web page that you need to pass to the program to parse the html page element you are interested in. Download from: http://selectorgadget.com/

Here is some code to read in the slashdot web page and gather the stories currently on their headlines.

library(rvest)
## Warning: package 'rvest' was built under R version 3.2.5
## Loading required package: xml2
## Warning: package 'xml2' was built under R version 3.2.5
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:qdap':
## 
##     %>%
## The following object is masked from 'package:XML':
## 
##     xml
url = "https://slashdot.org/"

doc.html = read_html(url)
text = doc.html %>% html_nodes(".story") %>% html_text()

text = gsub("[\t\n]","",text)
#text = paste(text, collapse=" ")
print(text[1:20])
##  [1] " Ask Slashdot: What's The Best Geeky Gift For Children?1"                                                
##  [2] " Why Apple Just Invested in Wind Turbines In China  (cnn.com) 45"                                        
##  [3] " Struggling Workers Found Sleeping In Tents Behind Amazon's Warehouse  (thecourier.co.uk) 147"           
##  [4] " Analysts Tout 'State of The Developer' Survey By Awarding RPG Characters  (amazon.com) 20"              
##  [5] " Inside the NYPD's Attempt To Build Community Trust Through Twitter  (backchannel.com) 35"               
##  [6] " Fedora-based Linux Distro Korora (Version 25) Now Available For Download  (betanews.com) 21"            
##  [7] " FBI Relents, Confirms Previously-Denied UFO Investigation  (muckrock.com) 58"                           
##  [8] " A 'Turkish Hacker' Is Giving Out Prizes For DDoS Attacks  (csoonline.com) 29"                           
##  [9] " The DEA Has Been Secretly Paying Transport Employees To Search Travelers' Bags  (economist.com) 118"    
## [10] " 5-Year-Old Critical Linux Vulnerability Patched  (threatpost.com) 53"                                   
## [11] " Uber Asks Everyone To Stop Making It The New Tinder  (sfgate.com) 122"                                  
## [12] " New Bug In Windows 10 Anniversary Update Brings Wi-Fi Disconnects  (infoworld.com) 133"                 
## [13] " US Think Tank Wants To Regulate The Design of IoT Devices For Security Purposes  (theregister.co.uk) 78"
## [14] " Autonomous Shuttle Brakes For Squirrels, Skateboarders, and Texting Students  (ieee.org) 68"            
## [15] " 'Star In a Jar' Fusion Reactor Works, Promises Infinite Energy  (space.com) 334"                        
## [16] NA                                                                                                        
## [17] NA                                                                                                        
## [18] NA                                                                                                        
## [19] NA                                                                                                        
## [20] NA

Program to read a web table using the selector gadget

Sometimes we need to read a table embedded in a web page and this is also a simple exercise, which is undertaken also with rvest.

library(rvest)
url = "http://finance.yahoo.com/q?uhb=uhb2&fr=uh3_finance_vert_gs&type=2button&s=IBM"

doc.html = read_html(url)
table = doc.html %>% html_nodes("table") %>% html_table()

print(table)
## [[1]]
##   X1     X2
## 1 NA Search
## 
## [[2]]
##               X1              X2
## 1 Previous Close          165.36
## 2           Open          166.00
## 3            Bid    166.42 x 300
## 4            Ask    166.59 x 100
## 5    Day's Range 164.60 - 166.72
## 6  52 Week Range 116.90 - 166.72
## 7         Volume       3,146,930
## 8    Avg. Volume       3,585,104
## 
## [[3]]
##                 X1           X2
## 1       Market Cap      158.34B
## 2             Beta         0.91
## 3   PE Ratio (TTM)        13.57
## 4        EPS (TTM)          N/A
## 5    Earnings Date          N/A
## 6 Dividend & Yield 5.60 (3.49%)
## 7 Ex-Dividend Date          N/A
## 8    1y Target Est          N/A

Note that this code extracted all the web tables in the Yahoo! Finance page and returned each one as a list item.

Program to read a web table into a data frame

Here we take note of some Russian language sites where we want to extract forex quotes and store them in a data frame.

library(rvest)

url1 <- "http://finance.i.ua/market/kiev/?type=1"  #Buy USD
url2 <- "http://finance.i.ua/market/kiev/?type=2"  #Sell USD

doc1.html = read_html(url1)
table1 = doc1.html %>% html_nodes("table") %>% html_table()
result1 = table1[[1]]
print(head(result1))
##      X1      X2       X3                   X4
## 1 Время    Курс    Сумма              Телефон
## 2 01:48 26.8801 114700 $ +38 093 \n  Показать
## 3 06:22    26.9    100 $ +38 093 \n  Показать
## 4 13:28    26.9   5000 $ +38 093 \n  Показать
## 5 13:29  26.901  37000 $ +38 093 \n  Показать
## 6 13:29   28.65  10000 € +38 093 \n  Показать
##                                       X5
## 1                                  Район
## 2   Ленинградская площадь Обменный пункт
## 3                      м Тараса Шеченка,
## 4 Центр Л. Тостого Д. Спорта Олимпийский
## 5    Обмен Валют Ленинградка Харьковское
## 6                                  подол
##                                                      X6
## 1                                           Комментарий
## 2 От 1000 дол. Крупная гривна. Звоните с 7. 00. Ярослав
## 3                                        Нового образца
## 4      Можно частями, могу подъехать от 500 или за €вро
## 5               От 3т. 500 грн купюры. Звоните. Ярослав
## 6                                         можно частями
doc2.html = read_html(url2)
table2 = doc2.html %>% html_nodes("table") %>% html_table()
result2 = table2[[1]]
print(head(result2))
##      X1      X2            X3                   X4
## 1 Время    Курс         Сумма              Телефон
## 2 01:36   0.426 970000 \u20bd +38 093 \n  Показать
## 3 01:47 26.9799      147000 $ +38 093 \n  Показать
## 4 01:50 28.7699       27000 € +38 093 \n  Показать
## 5 14:48    27.0        3500 $ +38 050 \n  Показать
## 6 13:29   26.97       55000 $ +38 096 \n  Показать
##                                     X5
## 1                                Район
## 2    Ленинградская площадь Обмен Валют
## 3 Ленинградская площадь Обменный пункт
## 4  Ленинградская площадь Обменный пунк
## 5                         Голосеевский
## 6                 Еврогазбанк петровка
##                                                                            X6
## 1                                                                 Комментарий
## 2 От 200т рублей. 5000 купюры. Или за доллар 63. 15. Звоните с 6. 00. Ярослав
## 3          От 100 дол. Без комиссий. Нового образца. Звоните с 7. 00. Ярослав
## 4                         От 3т евро. Разные купюры. Звоните с 7. 00. Ярослав
## 5             можно частями, обмен валют, м. Лыбедская, Антоновича (Горького)
## 6                                                               можно частями

Using the rselenium package

#Clicking Show More button Google Scholar page

library(RCurl)
library(RSelenium)
library(rvest)
library(stringr)
library(igraph)
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost" 
                      , port = 4444
                      , browserName = "firefox"
)
remDr$open()
remDr$getStatus()

Application to Google Scholar data

remDr$navigate("http://scholar.google.com")
webElem <- remDr$findElement(using = 'css selector', "input#gs_hp_tsi")
webElem$sendKeysToElement(list("Sanjiv Das", "\uE007"))
link <- webElem$getCurrentUrl()
page <- read_html(as.character(link))
citations <- page %>% html_nodes (".gs_rt2")
matched <- str_match_all(citations, "<a href=\"(.*?)\"")
scholarurl <- paste("https://scholar.google.com", matched[[1]][,2], sep="")
page <- read_html(as.character(scholarurl))
remDr$navigate(as.character(scholarurl))
authorlist <- page %>% html_nodes(css=".gs_gray") %>% html_text() # Selecting fields after CSS selector .gs_gray
authorlist <- as.data.frame(authorlist)
odd_index <- seq(1,nrow(authorlist),2) #Sorting data by even/odd indexes to form a table.
even_index <- seq (2,nrow(authorlist),2)
authornames <- data.frame(x=authorlist[odd_index,1])
papernames <- data.frame(x=authorlist[even_index,1])
pubmatrix <- cbind(authorlist,papernames)

# Building the view all link on scholar page.
a=str_split(matched, "user=")
x <- substring(a[[1]][2], 1,12)
y<- paste("https://scholar.google.com/citations?view_op=list_colleagues&hl=en&user=", x, sep="")
remDr$navigate(y)

#Reading view all page to get author list:
page <- read_html(as.character(y))
z <- page %>% html_nodes (".gsc_1usr_name")

x <-lapply(z,str_extract,">[A-Z]+[a-z]+ .+<")
x<-lapply(x,str_replace, ">","")
x<-lapply(x,str_replace, "<","")

# Graph function:
bsk <- as.matrix(cbind("SR Das", unlist(x)))
bsk.network<-graph.data.frame(bsk, directed=F)
plot(bsk.network)

Extracting Text from the Web using APIs

We now look to getting text from the web and using various APIs from different services like Twitter, Facebook, etc. You will need to open free developer accounts to do this on each site. You will also need the special R packages for each different source.

Twitter

First create a Twitter developer account to get the required credentials for accessing the API. See: https://dev.twitter.com/

The Twitter API needs a lot of handshaking…

##TWITTER EXTRACTOR
library(twitteR)
library(ROAuth)
library(RCurl)
download.file(url="https://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
#certificate file based on Privacy Enhanced Mail (PEM) protocol: https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail

cKey = "rIXqaZNxJ4A8YB6jhJsXEh9HX"  #These are my keys and won't work for you
cSecret = "KC5kRgsJVrBV6vNIndrF69tcHFfzwcqpQOzLMO80Cu3dVFpcZb"   #use your own secret
reqURL = "https://api.twitter.com/oauth/request_token"
accURL = "https://api.twitter.com/oauth/access_token"
authURL = "https://api.twitter.com/oauth/authorize"

#NOW SUBMIT YOUR CODES AND ASK FOR CREDENTIALS
cred = OAuthFactory$new(consumerKey=cKey, consumerSecret=cSecret,requestURL=reqURL, accessURL=accURL,authURL=authURL)
cred$handshake(cainfo="cacert.pem") #Asks for token

#Test and save credentials
#registerTwitterOAuth(cred)
#save(list="cred",file="twitteR_credentials")
#FIRST PHASE DONE

Accessing Twitter

##USE httr, SECOND PHASE
library(httr)
#options(httr_oauth_cache=T)
accToken = "18666236-DmDE1wwbpvPbDcw9kwt9yThGeyYhjfpVVywrHuhOQ"
accTokenSecret = "cttbpxpTtqJn7wrCP36I59omNI5GQHXXgV41sKwUgc"
setup_twitter_oauth(cKey,cSecret,accToken,accTokenSecret)  #At prompt type 1

This more direct code chunk does handshaking better and faster than the preceding.

library(stringr)
library(twitteR)
## 
## Attaching package: 'twitteR'
## The following objects are masked from 'package:dplyr':
## 
##     id, location
## The following object is masked from 'package:qdapTools':
## 
##     id
library(ROAuth)
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.2.4
## Loading required package: bitops
cKey = "rIXqaZNxJ4A8YB6jhJsXEh9HX"  
cSecret = "KC5kRgsJVrBV6vNIndrF69tcHFfzwcqpQOzLMO80Cu3dVFpcZb"   
accToken = "18666236-DmDE1wwbpvPbDcw9kwt9yThGeyYhjfpVVywrHuhOQ"
accTokenSecret = "cttbpxpTtqJn7wrCP36I59omNI5GQHXXgV41sKwUgc"

setup_twitter_oauth(consumer_key = cKey, 
                    consumer_secret = cSecret, 
                    access_token = accToken,
                    access_secret = accTokenSecret)
## [1] "Using direct authentication"

This completes the handshaking with Twitter. Now we can access tweets using the functions in the twitteR package.

Using the twitteR package

#EXAMPLE 1
s = searchTwitter("#GOOG")  #This is a list
s

#CONVERT TWITTER LIST TO TEXT ARRAY (see documentation in twitteR package)
twts = twListToDF(s)  #This gives a dataframe with the tweets
names(twts)

twts_array = twts$text
print(twts$retweetCount)
twts_array

#EXAMPLE 2
s = getUser("srdas")
fr = s$getFriends()
print(length(fr))
print(fr[1:10])
s_tweets = userTimeline("srdas",n=20)
print(s_tweets)

getCurRateLimitInfo(c("srdas"))

Getting Streaming Data from Twitter

This assumes you have a working twitter account and have already connected R to it using twitteR package.

library(streamR)
filterStream(file.name = "tweets.json", # Save tweets in a json file
             track = "useR_Stanford" , # Collect tweets with useR_Stanford over 60 seconds. Can use twitter handles or keywords.
             language = "en",
             timeout = 30, # Keep connection alive for 60 seconds
             oauth = cred) # Use OAuth credentials

tweets.df <- parseTweets("tweets.json", simplify = FALSE) # parse the json file and save to a data frame called tweets.df. Simplify = FALSE ensures that we include lat/lon information in that data frame.

Retrieving tweets of a particular user over a 60 second time period

filterStream(file.name = "tweets.json", # Save tweets in a json file
             track = "3497513953" , # Collect tweets from useR2016 feed over 60 seconds. Must use twitter ID of the user.
             language = "en",
             timeout = 30, # Keep connection alive for 60 seconds
             oauth = cred) # Use my_oauth file as the OAuth credentials
tweets.df <- parseTweets("tweets.json", simplify = FALSE)

Streaming messages from the accounts your user follows.

userStream( file.name="my_timeline.json", with="followings",tweets=10, oauth=cred )

Facebook

Now we move on to using Facebook, which is a little less trouble than Twitter. Also the results may be used for creating interesting networks.

##FACEBOOK EXTRACTOR
library(Rfacebook)
library(SnowballC)
library(Rook)
library(ROAuth)
app_id = "847737771920076"   # USE YOUR OWN IDs
app_secret = "a120a2ec908d9e00fcd3c619cad7d043"
fb_oauth = fbOAuth(app_id,app_secret,extended_permissions=TRUE)
#save(fb_oauth,file="fb_oauth")

#DIRECT LOAD
load("fb_oauth")

Examples

##EXAMPLES
bbn = getUsers("bloombergnews",token=fb_oauth)
print(bbn)

page = getPage(page="bloombergnews",token=fb_oauth,n=20)
print(dim(page))

print(head(page))

print(names(page))

print(page$message)

print(page$message[11])

Yelp - Setting up an authorization

First we examine the protocol for connecting to the Yelp API. This assumes you have opei

###CODE to connect to YELP.
consumerKey = "z6w-Or6HSyKbdUTmV9lbOA"
consumerSecret = "ImUufP3yU9FmNWWx54NUbNEBcj8"
token = "mBzEBjhYIGgJZnmtTHLVdQ-0cyfFVRGu"
token_secret = "v0FGCL0TS_dFDWFwH3HptDZhiLE"

Yelp - handshaking with the API

require(httr)
require(httpuv)
require(jsonlite)
# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)
## Searching the top ten bars in Chicago and SF.
limit <- 10

# 10 bars in Chicago
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&location=Chicago%20IL&term=bar")
# or 10 bars by geo-coordinates
yelpurl <- paste0("http://api.yelp.com/v2/search/?limit=",limit,"&ll=37.788022,-122.399797&term=bar")

locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)
locationdataList=jsonlite::fromJSON(toJSON(locationdataContent))
head(data.frame(locationdataList))

for (j in 1:limit) {
  print(locationdataContent$businesses[[j]]$snippet_text)
}

Dictionaries

  1. Webster’s defines a “dictionary” as “…a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses.”

  2. The Harvard General Inquirer: http://www.wjh.harvard.edu/~inquirer/

  3. Standard Dictionaries: www.dictionary.com, and www.merriam-webster.com.

  4. Computer dictionary: http://www.hyperdictionary.com/computer that contains about 14,000 computer related words, such as “byte” or “hyperlink”.

  5. Math dictionary, such as http://www.amathsdictionaryforkids.com/dictionary.html.

  6. Medical dictionary, see http://www.hyperdictionary.com/medical.

Dictionaries - II

  1. Internet lingo dictionaries may be used to complement standard dictionaries with words that are not usually found in standard language, for example, see http://www.netlingo.com/dictionary/all.php for words such as “2BZ4UQT” which stands for “too busy for you cutey” (LOL). When extracting text messages, postings on Facebook, or stock message board discussions, internet lingo does need to be parsed and such a dictionary is very useful.

  2. Associative dictionaries are also useful when trying to find context, as the word may be related to a concept, identified using a dictionary such as http://www.visuwords.com/. This dictionary doubles up as a thesaurus, as it provides alternative words and phrases that mean the same thing, and also related concepts.

  3. Value dictionaries deal with values and may be useful when only affect (positive or negative) is insufficient for scoring text. The Lasswell Value Dictionary http://www.wjh.harvard.edu/~inquirer/lasswell.htm may be used to score the loading of text on the eight basic value categories: Wealth, Power, Respect, Rectitude, Skill, Enlightenment, Affection, and Well being.

Lexicons

  1. A lexicon is defined by Webster’s as “a book containing an alphabetical arrangement of the words in a language and their definitions; the vocabulary of a language, an individual speaker or group of speakers, or a subject; the total stock of morphemes in a language.” This suggests it is not that different from a dictionary.

  2. A “morpheme” is defined as “a word or a part of a word that has a meaning and that contains no smaller part that has a meaning.”

  3. In the text analytics realm, we will take a lexicon to be a smaller, special purpose dictionary, containing words that are relevant to the domain of interest.

  4. The benefit of a lexicon is that it enables focusing only on words that are relevant to the analytics and discards words that are not.

  5. Another benefit is that since it is a smaller dictionary, the computational effort required by text analytics algorithms is drastically reduced.

Constructing a lexicon

  1. By hand. This is an effective technique and the simplest. It calls for a human reader who scans a representative sample of text documents and culls important words that lend interpretive meaning.

  2. Examine the term document matrix for most frequent words, and pick the ones that have high connotation for the classification task at hand.

  3. Use pre-classified documents in a text corpus. We analyze the separate groups of documents to find words whose difference in frequency between groups is highest. Such words are likely to be better in discriminating between groups.

Lexicons as Word Lists

  1. Das and Chen (2007) constructed a lexicon of about 375 words that are useful in parsing sentiment from stock message boards. This lexicon also introduced the notion of “negation tagging” into the literature.

  2. Loughran and McDonald (2011):

Scoring Text

Mood Scoring using Harvard Inquirer

Creating Positive and Negative Word Lists

#MOOD SCORING USING HARVARD INQUIRER
#Read in the Harvard Inquirer Dictionary
#And create a list of positive and negative words
HIDict = readLines("data_files/inqdict.txt")
dict_pos = HIDict[grep("Pos",HIDict)]
poswords = NULL
for (s in dict_pos) {
    s = strsplit(s,"#")[[1]][1]
    poswords = c(poswords,strsplit(s," ")[[1]][1])
}
dict_neg = HIDict[grep("Neg",HIDict)]
negwords = NULL
for (s in dict_neg) {
    s = strsplit(s,"#")[[1]][1]
    negwords = c(negwords,strsplit(s," ")[[1]][1])
}
poswords = tolower(poswords)
negwords = tolower(negwords)
print(sample(poswords,25))
##  [1] "athletic"     "matchless"    "woo"          "clear"       
##  [5] "defend"       "unbroken"     "truth"        "abundance"   
##  [9] "tactics"      "joke"         "safe"         "generate"    
## [13] "considerate"  "plain"        "self-respect" "staunchness" 
## [17] "allow"        "glee"         "astound"      "sparkle"     
## [21] "standardize"  "sympathetic"  "brainy"       "fair"        
## [25] "advance"
print(sample(negwords,25))
##  [1] "frighten"      "regression"    "recklessness"  "expose"       
##  [5] "berserk"       "competitor"    "corrode"       "unimpeachable"
##  [9] "paralysis"     "malicious"     "thorny"        "battle"       
## [13] "peculiar"      "defect"        "depress"       "unforgettable"
## [17] "shock"         "pound"         "disapproval"   "mind"         
## [21] "sketchy"       "beastly"       "unseen"        "wrought"      
## [25] "temptation"
poswords = unique(poswords)
negwords = unique(negwords)
print(length(poswords))
## [1] 1647
print(length(negwords))
## [1] 2121

The preceding code created two arrays, one of positive words and another of negative words.

You can also directly use the EmoLex which contains positive and negative words already, see: NRC Word-Emotion Lexicon: http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

One Function to Rule All Text

In order to score text, we need to clean it first and put it into an array to compare with the word list of positive and negative words. I wrote a general purpose function that grabs text and cleans it up for further use.

library(tm)
library(stringr)
#READ IN TEXT FOR ANALYSIS, PUT IT IN A CORPUS, OR ARRAY, OR FLAT STRING
#cstem=1, if stemming needed
#cstop=1, if stopwords to be removed
#ccase=1 for lower case, ccase=2 for upper case
#cpunc=1, if punctuation to be removed
#cflat=1 for flat text wanted, cflat=2 if text array, else returns corpus
read_web_page = function(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=0) {
    text = readLines(url)
    text = text[setdiff(seq(1,length(text)),grep("<",text))]
    text = text[setdiff(seq(1,length(text)),grep(">",text))]
    text = text[setdiff(seq(1,length(text)),grep("]",text))]
    text = text[setdiff(seq(1,length(text)),grep("}",text))]
    text = text[setdiff(seq(1,length(text)),grep("_",text))]
    text = text[setdiff(seq(1,length(text)),grep("\\/",text))]
    ctext = Corpus(VectorSource(text))
    if (cstem==1) { ctext = tm_map(ctext, stemDocument) }
    if (cstop==1) { ctext = tm_map(ctext, removeWords, stopwords("english"))}
    if (cpunc==1) { ctext = tm_map(ctext, removePunctuation) }
    if (ccase==1) { ctext = tm_map(ctext, tolower) }
    if (ccase==2) { ctext = tm_map(ctext, toupper) }
    text = ctext
    #CONVERT FROM CORPUS IF NEEDED
    if (cflat>0) {
        text = NULL
        for (j in 1:length(ctext)) {
            temp = ctext[[j]]$content
            if (temp!="") { text = c(text,temp) }
        }
        text = as.array(text)
    }
    if (cflat==1) {
        text = paste(text,collapse="\n")
        text = str_replace_all(text, "[\r\n]" , " ")
    }
    result = text
}

Example

Now apply this function and see how we can get some clean text.

url = "http://srdas.github.io/research.htm"
res = read_web_page(url,0,0,0,1,1)
print(res)
## [1] "Data Science Theories Models Algorithms and Analytics web book  work in progress Derivatives Principles and Practice 2010 Rangarajan Sundaram and Sanjiv Das McGraw Hill An IndexBased Measure of Liquidity with George Chacko and Rong Fan 2016 Matrix Metrics NetworkBased Systemic Risk Scoring 2016 of systemic risk This paper won the First Prize in the MITCFP competition 2016 for  the best paper on SIFIs systemically important financial institutions  It also won the best paper award at  Credit Spreads with Dynamic Debt with Seoyoung Kim 2015  Text and Context Language Analytics for Finance 2014 Strategic Loan Modification An OptionsBased Response to Strategic Default Options and Structured Products in Behavioral Portfolios with Meir Statman 2013  and barrier range notes in the presence of fattailed outcomes using copulas Polishing Diamonds in the Rough The Sources of Syndicated Venture Performance 2011 with Hoje Jo and Yongtae Kim  Optimization with Mental Accounts 2010 with Harry Markowitz Jonathan Accountingbased versus marketbased crosssectional models of CDS spreads  with Paul Hanouna and Atulya Sarin 2009  Hedging Credit Equity Liquidity Matters with Paul Hanouna 2009 An Integrated Model for Hybrid Securities Yahoo for Amazon Sentiment Extraction from Small Talk on the Web Common Failings How Corporate Defaults are Correlated  with Darrell Duffie Nikunj Kapadia and Leandro Saita A Clinical Study of Investor Discussion and Sentiment  with Asis MartinezJerez and Peter Tufano 2005  International Portfolio Choice with Systemic Risk The loss resulting from diminished diversification is small while Speech Signaling Risksharing and the Impact of Fee Structures on investor welfare Contrary to regulatory intuition incentive structures A DiscreteTime Approach to Noarbitrage Pricing of Credit derivatives with Rating Transitions with Viral Acharya and Rangarajan Sundaram Pricing Interest Rate Derivatives A General Approachwith George Chacko A DiscreteTime Approach to ArbitrageFree Pricing of Credit Derivatives  The Psychology of Financial Decision Making A Case for TheoryDriven Experimental Enquiry 1999 with Priya Raghubir Of Smiles and Smirks A Term Structure Perspective A Theory of Banking Structure 1999 with Ashish Nanda by function based upon two dimensions the degree of information asymmetry  A Theory of Optimal Timing and Selectivity  A Direct DiscreteTime Approach to PoissonGaussian Bond Option Pricing in the HeathJarrowMorton  The Central Tendency A Second Factor in Bond Yields 1998 with Silverio Foresi and Pierluigi Balduzzi   Efficiency with Costly Information A Reinterpretation of Evidence from Managed Portfolios with Edwin Elton Martin Gruber and Matt  Presented and Reprinted in the Proceedings of The  Seminar on the Analysis of Security Prices at the Center  for Research in Security   Prices  at the University of  Managing Rollover Risk with Capital Structure Covenants in Structured Finance Vehicles 2016 The Design and Risk Management of Structured Finance Vehicles 2016 Post the recent subprime financial crisis we inform the creation of safer SIVs  in structured finance and propose avenues of mitigating risks faced by senior debt through  Coming up Short Managing Underfunded Portfolios in an LDIES Framework 2014  with Seoyoung Kim and Meir Statman   Going for Broke Restructuring Distressed Debt Portfolios 2014 Digital Portfolios 2013  Options on Portfolios with HigherOrder Moments 2009 options on a multivariate system of assets calibrated to the return  Dealing with Dimension Option Pricing on Factor Trees 2009 you to price options on multiple assets in a unified fraamework Computational Modeling Correlated Default with a Forest of Binomial Trees 2007 with Basel II Correlation Related Issues 2007  Correlated Default Risk 2006 with Laurence Freed Gary Geng and Nikunj Kapadia increase as markets worsen Regime switching models are needed to explain dynamic A Simple Model for Pricing Equity Options with Markov Switching State Variables 2006 with Donald Aingworth and Rajeev Motwani The Firms Management of Social Interactions 2005 with D Godes D Mayzlin Y Chen S Das C Dellarocas  B Pfeieffer B Libai S Sen M Shi and P Verlegh  Financial Communities with Jacob Sisk 2005  Summer 112123 Monte Carlo Markov Chain Methods for Derivative Pricing and Risk Assessmentwith Alistair Sinclair 2005  where incomplete information about the value of an asset may be exploited to  undertake fast and accurate pricing Proof that a fully polynomial randomized  Correlated Default Processes A CriterionBased Copula Approach Special Issue on Default Risk  Private Equity Returns An Empirical Examination of the Exit of VentureBacked Companies with Murali Jagannathan and Atulya Sarin firm being financed the valuation at the time of financing and the prevailing market sentiment Helps understand the risk premium required for the Issue on Computational Methods in Economics and Finance   December 5569 Bayesian Migration in Credit Ratings Based on Probabilities of The Impact of Correlated Default Risk on Credit Portfolios with Gifford Fong and Gary Geng How Diversified are Internationally Diversified Portfolios TimeVariation in the Covariances between International Returns DiscreteTime Bond and Option Pricing for JumpDiffusion Macroeconomic Implications of Search Theory for the Labor Market Auction Theory A Summary with Applications and Evidence from the Treasury Markets 1996 with Rangarajan Sundaram A Simple Approach to Three Factor Affine Models of the Term Structure with Pierluigi Balduzzi Silverio Foresi and Rangarajan Analytical Approximations of  the Term Structure for Jumpdiffusion Processes A Numerical Analysis 1996  Markov Chain Term Structure Models Extensions and Applications Exact Solutions for Bond and Options Prices with Systematic Jump Risk 1996 with Silverio Foresi Pricing Credit Sensitive Debt when Interest Rates Credit Ratings and Credit Spreads are Stochastic 1996  v52 161198 Did CDS Trading Improve the Market for Corporate Bonds 2016  with Madhu Kalimipalli and Subhankar Nayak  Big Datas Big Muscle 2016  Portfolios for Investors Who Want to Reach Their Goals While Staying on the MeanVariance Efficient Frontier 2011  with Harry Markowitz Jonathan Scheid and Meir Statman  News Analytics Framework Techniques and Metrics The Handbook of News Analytics in Finance May 2011 John Wiley  Sons UK  Random Lattices for Option Pricing Problems in Finance 2011 Implementing Option Pricing Models using Python and Cython 2010 The Finance Web Internet Information and Markets 2010  Financial Applications with Parallel R 2009  Recovery Swaps 2009 with Paul Hanouna   Recovery Rates 2009with Paul Hanouna  A Simple Model for Pricing Securities with a DebtEquity Linkage 2008 in  Credit Default Swap Spreads 2006 with Paul Hanouna  MultipleCore Processors for Finance Applications 2006  Power Laws 2005 with Jacob Sisk  Genetic Algorithms 2005 Recovery Risk 2005 Venture Capital Syndication with Hoje Jo and Yongtae Kim 2004 Technical Analysis with David Tien 2004 Liquidity and the Bond Markets with Jan Ericsson and  Madhu Kalimipalli 2003 Modern Pricing of Interest Rate Derivatives  Book Review  Contagion 2003 Hedge Funds 2003 Reprinted in  Working Papers on Hedge Funds in The World of Hedge Funds  Characteristics and  Analysis 2005 World Scientific The Internet and Investors 2003   Useful things to know about Correlated Default Risk with Gifford Fong Laurence Freed Gary Geng and Nikunj Kapadia The Regulation of Fee Structures in Mutual Funds A Theoretical Analysis  with Rangarajan Sundaram 1998 NBER WP No 6639 in the Courant Institute of Mathematical Sciences special volume on A DiscreteTime Approach to ArbitrageFree Pricing of Credit Derivatives  with Rangarajan Sundaram reprinted in  the Courant Institute of Mathematical Sciences special volume on Stochastic Mean Models of the Term Structure with Pierluigi Balduzzi Silverio Foresi and Rangarajan Sundaram  John Wiley  Sons Inc 128161 Interest Rate Modeling with JumpDiffusion Processes  John Wiley  Sons Inc 162189 Comments on Pricing ExcessofLoss Reinsurance Contracts against Catastrophic Loss by J David Cummins C Lewis and Richard Phillips Froot Ed University of Chicago Press 1999 141145   Pricing Credit Derivatives  J Frost and JG Whittaker 101138 On the Recursive Implementation of Term Structure Models  Local Volatility and the Recovery Rate of Credit Default Swaps  with Jeroen Jansen and Frank Fabozzi Efficient Rebalancing of Taxable Portfolios with Dan Ostrov Dennis Ding Vincent Newell  The Fast and the Curious VC Drift  with Amit Bubna and Paul Hanouna  Venture Capital Communities with Amit Bubna and Nagpurnanand Prabhala                                                  "

Mood Scoring Text

Now we will take a different page of text and mood score it.

#EXAMPLE OF MOOD SCORING
library(stringr)
url = "http://srdas.github.io/bio-candid.html"
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=1,cflat=1)
print(text)
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara Universitys Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds postgraduate degrees in Finance MPhil and PhD from New York University Computer Science MS from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad BCom in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management coeditor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the AsiaPacific region as a VicePresident at Citibank His current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over ninety articles in academic journals and has won numerous awards for research and teaching His recent book Derivatives Principles and Practice was published in May 2010  He currently also serves as a Senior Fellow at the FDIC Center for Financial Research After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research  He graduated in 1994 with a PhD from NYU and since then spent five years in Boston and now lives in San Jose California  Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank NA in the AsiaPacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse Sanjivs research style is instilled with a distinct New York state of mind  it is chaotic diverse with minimal method to the madness He has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley Coastal living did a lot to mold Sanjiv who needs to live near the ocean  The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education  you can check out any time you like but you can never leave Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good"
text = str_replace_all(text,"nbsp"," ")
text
## [1] "Sanjiv Das is the William and Janice Terry Professor of Finance at Santa Clara Universitys Leavey School of Business He previously held faculty appointments as Associate Professor at Harvard Business School and UC Berkeley He holds postgraduate degrees in Finance MPhil and PhD from New York University Computer Science MS from UC Berkeley an MBA from the Indian Institute of Management Ahmedabad BCom in Accounting and Economics University of Bombay Sydenham College and is also a qualified Cost and Works Accountant He is a senior editor of The Journal of Investment Management coeditor of The Journal of Derivatives and The Journal of Financial Services Research and Associate Editor of other academic journals Prior to being an academic he worked in the derivatives business in the AsiaPacific region as a VicePresident at Citibank His current research interests include the modeling of default risk machine learning social networks derivatives pricing models portfolio theory and venture capital He has published over ninety articles in academic journals and has won numerous awards for research and teaching His recent book Derivatives Principles and Practice was published in May 2010  He currently also serves as a Senior Fellow at the FDIC Center for Financial Research After loafing and working in many parts of Asia but never really growing up Sanjiv moved to New York to change the world hopefully through research  He graduated in 1994 with a PhD from NYU and since then spent five years in Boston and now lives in San Jose California  Sanjiv loves animals places in the world where the mountains meet the sea riding sport motorbikes reading gadgets science fiction movies and writing cool software code When there is time available from the excitement of daily life Sanjiv writes academic papers which helps him relax Always the contrarian Sanjiv thinks that New York City is the most calming place in the world after California of course Sanjiv is now a Professor of Finance at Santa Clara University He came to SCU from Harvard Business School and spent a year at UC Berkeley In his past life in the unreal world Sanjiv worked at Citibank NA in the AsiaPacific region He takes great pleasure in merging his many previous lives into his current existence which is incredibly confused and diverse Sanjivs research style is instilled with a distinct New York state of mind  it is chaotic diverse with minimal method to the madness He has published articles on derivatives termstructure models mutual funds the internet portfolio choice banking models credit risk and has unpublished articles in many other areas Some years ago he took time off to get another degree in computer science at Berkeley confirming that an unchecked hobby can quickly become an obsession There he learnt about the fascinating field of Randomized Algorithms skills he now applies earnestly to his editorial work and other pursuits many of which stem from being in the epicenter of Silicon Valley Coastal living did a lot to mold Sanjiv who needs to live near the ocean  The many walks in Greenwich village convinced him that there is no such thing as a representative investor yet added many unique features to his personal utility function He learnt that it is important to open the academic door to the ivory tower and let the world in Academia is a real challenge given that he has to reconcile many more opinions than ideas He has been known to have turned down many offers from Mad magazine to publish his academic work As he often explains you never really finish your education  you can check out any time you like but you can never leave Which is why he is doomed to a lifetime in Hotel California And he believes that if this is as bad as it gets life is really pretty good"
text = unlist(strsplit(text," "))
print(text)
##   [1] "Sanjiv"         "Das"            "is"             "the"           
##   [5] "William"        "and"            "Janice"         "Terry"         
##   [9] "Professor"      "of"             "Finance"        "at"            
##  [13] "Santa"          "Clara"          "Universitys"    "Leavey"        
##  [17] "School"         "of"             "Business"       "He"            
##  [21] "previously"     "held"           "faculty"        "appointments"  
##  [25] "as"             "Associate"      "Professor"      "at"            
##  [29] "Harvard"        "Business"       "School"         "and"           
##  [33] "UC"             "Berkeley"       "He"             "holds"         
##  [37] "postgraduate"   "degrees"        "in"             "Finance"       
##  [41] "MPhil"          "and"            "PhD"            "from"          
##  [45] "New"            "York"           "University"     "Computer"      
##  [49] "Science"        "MS"             "from"           "UC"            
##  [53] "Berkeley"       "an"             "MBA"            "from"          
##  [57] "the"            "Indian"         "Institute"      "of"            
##  [61] "Management"     "Ahmedabad"      "BCom"           "in"            
##  [65] "Accounting"     "and"            "Economics"      "University"    
##  [69] "of"             "Bombay"         "Sydenham"       "College"       
##  [73] "and"            "is"             "also"           "a"             
##  [77] "qualified"      "Cost"           "and"            "Works"         
##  [81] "Accountant"     "He"             "is"             "a"             
##  [85] "senior"         "editor"         "of"             "The"           
##  [89] "Journal"        "of"             "Investment"     "Management"    
##  [93] "coeditor"       "of"             "The"            "Journal"       
##  [97] "of"             "Derivatives"    "and"            "The"           
## [101] "Journal"        "of"             "Financial"      "Services"      
## [105] "Research"       "and"            "Associate"      "Editor"        
## [109] "of"             "other"          "academic"       "journals"      
## [113] "Prior"          "to"             "being"          "an"            
## [117] "academic"       "he"             "worked"         "in"            
## [121] "the"            "derivatives"    "business"       "in"            
## [125] "the"            "AsiaPacific"    "region"         "as"            
## [129] "a"              "VicePresident"  "at"             "Citibank"      
## [133] "His"            "current"        "research"       "interests"     
## [137] "include"        "the"            "modeling"       "of"            
## [141] "default"        "risk"           "machine"        "learning"      
## [145] "social"         "networks"       "derivatives"    "pricing"       
## [149] "models"         "portfolio"      "theory"         "and"           
## [153] "venture"        "capital"        "He"             "has"           
## [157] "published"      "over"           "ninety"         "articles"      
## [161] "in"             "academic"       "journals"       "and"           
## [165] "has"            "won"            "numerous"       "awards"        
## [169] "for"            "research"       "and"            "teaching"      
## [173] "His"            "recent"         "book"           "Derivatives"   
## [177] "Principles"     "and"            "Practice"       "was"           
## [181] "published"      "in"             "May"            "2010"          
## [185] ""               "He"             "currently"      "also"          
## [189] "serves"         "as"             "a"              "Senior"        
## [193] "Fellow"         "at"             "the"            "FDIC"          
## [197] "Center"         "for"            "Financial"      "Research"      
## [201] "After"          "loafing"        "and"            "working"       
## [205] "in"             "many"           "parts"          "of"            
## [209] "Asia"           "but"            "never"          "really"        
## [213] "growing"        "up"             "Sanjiv"         "moved"         
## [217] "to"             "New"            "York"           "to"            
## [221] "change"         "the"            "world"          "hopefully"     
## [225] "through"        "research"       ""               "He"            
## [229] "graduated"      "in"             "1994"           "with"          
## [233] "a"              "PhD"            "from"           "NYU"           
## [237] "and"            "since"          "then"           "spent"         
## [241] "five"           "years"          "in"             "Boston"        
## [245] "and"            "now"            "lives"          "in"            
## [249] "San"            "Jose"           "California"     ""              
## [253] "Sanjiv"         "loves"          "animals"        "places"        
## [257] "in"             "the"            "world"          "where"         
## [261] "the"            "mountains"      "meet"           "the"           
## [265] "sea"            "riding"         "sport"          "motorbikes"    
## [269] "reading"        "gadgets"        "science"        "fiction"       
## [273] "movies"         "and"            "writing"        "cool"          
## [277] "software"       "code"           "When"           "there"         
## [281] "is"             "time"           "available"      "from"          
## [285] "the"            "excitement"     "of"             "daily"         
## [289] "life"           "Sanjiv"         "writes"         "academic"      
## [293] "papers"         "which"          "helps"          "him"           
## [297] "relax"          "Always"         "the"            "contrarian"    
## [301] "Sanjiv"         "thinks"         "that"           "New"           
## [305] "York"           "City"           "is"             "the"           
## [309] "most"           "calming"        "place"          "in"            
## [313] "the"            "world"          "after"          "California"    
## [317] "of"             "course"         "Sanjiv"         "is"            
## [321] "now"            "a"              "Professor"      "of"            
## [325] "Finance"        "at"             "Santa"          "Clara"         
## [329] "University"     "He"             "came"           "to"            
## [333] "SCU"            "from"           "Harvard"        "Business"      
## [337] "School"         "and"            "spent"          "a"             
## [341] "year"           "at"             "UC"             "Berkeley"      
## [345] "In"             "his"            "past"           "life"          
## [349] "in"             "the"            "unreal"         "world"         
## [353] "Sanjiv"         "worked"         "at"             "Citibank"      
## [357] "NA"             "in"             "the"            "AsiaPacific"   
## [361] "region"         "He"             "takes"          "great"         
## [365] "pleasure"       "in"             "merging"        "his"           
## [369] "many"           "previous"       "lives"          "into"          
## [373] "his"            "current"        "existence"      "which"         
## [377] "is"             "incredibly"     "confused"       "and"           
## [381] "diverse"        "Sanjivs"        "research"       "style"         
## [385] "is"             "instilled"      "with"           "a"             
## [389] "distinct"       "New"            "York"           "state"         
## [393] "of"             "mind"           ""               "it"            
## [397] "is"             "chaotic"        "diverse"        "with"          
## [401] "minimal"        "method"         "to"             "the"           
## [405] "madness"        "He"             "has"            "published"     
## [409] "articles"       "on"             "derivatives"    "termstructure" 
## [413] "models"         "mutual"         "funds"          "the"           
## [417] "internet"       "portfolio"      "choice"         "banking"       
## [421] "models"         "credit"         "risk"           "and"           
## [425] "has"            "unpublished"    "articles"       "in"            
## [429] "many"           "other"          "areas"          "Some"          
## [433] "years"          "ago"            "he"             "took"          
## [437] "time"           "off"            "to"             "get"           
## [441] "another"        "degree"         "in"             "computer"      
## [445] "science"        "at"             "Berkeley"       "confirming"    
## [449] "that"           "an"             "unchecked"      "hobby"         
## [453] "can"            "quickly"        "become"         "an"            
## [457] "obsession"      "There"          "he"             "learnt"        
## [461] "about"          "the"            "fascinating"    "field"         
## [465] "of"             "Randomized"     "Algorithms"     "skills"        
## [469] "he"             "now"            "applies"        "earnestly"     
## [473] "to"             "his"            "editorial"      "work"          
## [477] "and"            "other"          "pursuits"       "many"          
## [481] "of"             "which"          "stem"           "from"          
## [485] "being"          "in"             "the"            "epicenter"     
## [489] "of"             "Silicon"        "Valley"         "Coastal"       
## [493] "living"         "did"            "a"              "lot"           
## [497] "to"             "mold"           "Sanjiv"         "who"           
## [501] "needs"          "to"             "live"           "near"          
## [505] "the"            "ocean"          ""               "The"           
## [509] "many"           "walks"          "in"             "Greenwich"     
## [513] "village"        "convinced"      "him"            "that"          
## [517] "there"          "is"             "no"             "such"          
## [521] "thing"          "as"             "a"              "representative"
## [525] "investor"       "yet"            "added"          "many"          
## [529] "unique"         "features"       "to"             "his"           
## [533] "personal"       "utility"        "function"       "He"            
## [537] "learnt"         "that"           "it"             "is"            
## [541] "important"      "to"             "open"           "the"           
## [545] "academic"       "door"           "to"             "the"           
## [549] "ivory"          "tower"          "and"            "let"           
## [553] "the"            "world"          "in"             "Academia"      
## [557] "is"             "a"              "real"           "challenge"     
## [561] "given"          "that"           "he"             "has"           
## [565] "to"             "reconcile"      "many"           "more"          
## [569] "opinions"       "than"           "ideas"          "He"            
## [573] "has"            "been"           "known"          "to"            
## [577] "have"           "turned"         "down"           "many"          
## [581] "offers"         "from"           "Mad"            "magazine"      
## [585] "to"             "publish"        "his"            "academic"      
## [589] "work"           "As"             "he"             "often"         
## [593] "explains"       "you"            "never"          "really"        
## [597] "finish"         "your"           "education"      ""              
## [601] "you"            "can"            "check"          "out"           
## [605] "any"            "time"           "you"            "like"          
## [609] "but"            "you"            "can"            "never"         
## [613] "leave"          "Which"          "is"             "why"           
## [617] "he"             "is"             "doomed"         "to"            
## [621] "a"              "lifetime"       "in"             "Hotel"         
## [625] "California"     "And"            "he"             "believes"      
## [629] "that"           "if"             "this"           "is"            
## [633] "as"             "bad"            "as"             "it"            
## [637] "gets"           "life"           "is"             "really"        
## [641] "pretty"         "good"
posmatch = match(text,poswords)
numposmatch = length(posmatch[which(posmatch>0)])
negmatch = match(text,negwords)
numnegmatch = length(negmatch[which(negmatch>0)])
print(c(numposmatch,numnegmatch))
## [1] 26 16
#FURTHER EXPLORATION OF THESE OBJECTS
print(length(text))
## [1] 642
print(posmatch)
##   [1]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [15]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [29]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [43]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [57]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [71]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [85]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
##  [99]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [113]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [127]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [141]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [155]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [169]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [183]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [197]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [211]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [225]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [239]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [253]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  994   NA   NA   NA
## [267]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [281]   NA   NA   NA   NA   NA  611   NA   NA   NA   NA   NA   NA   NA   NA
## [295]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [309]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [323]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [337]   NA   NA   NA   NA   NA   NA   NA   NA   NA  800   NA   NA   NA   NA
## [351]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  761
## [365] 1144   NA   NA  800   NA   NA   NA   NA  800   NA   NA   NA   NA   NA
## [379]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  515   NA   NA   NA
## [393]   NA 1011   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [407]   NA   NA   NA   NA   NA   NA   NA 1036   NA   NA   NA   NA   NA   NA
## [421]   NA  455   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [435]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [449]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [463]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  800   NA   NA
## [477]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [491]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA  941   NA
## [505]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [519]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 1571   NA   NA  800
## [533]   NA   NA   NA   NA   NA   NA   NA   NA  838   NA 1076   NA   NA   NA
## [547]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 1255   NA
## [561]   NA   NA   NA   NA   NA 1266   NA   NA   NA   NA   NA   NA   NA   NA
## [575]   NA   NA  781   NA   NA   NA   NA   NA   NA   NA   NA   NA  800   NA
## [589]   NA   NA   NA   NA   NA   NA   NA   NA   NA 1645  542   NA   NA   NA
## [603]   NA   NA   NA   NA   NA  940   NA   NA   NA   NA   NA   NA   NA   NA
## [617]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [631]   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA 1184  747
print(text[77])
## [1] "qualified"
print(poswords[204])
## [1] "back"
is.na(posmatch)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [45]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [67]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [111]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [122]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [144]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [155]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [166]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [177]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [188]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [199]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [210]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [221]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [232]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [243]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [254]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [265]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [276]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [287]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [298]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [309]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [320]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [331]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [342]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [353]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [364] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## [375]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [386]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [397]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [408]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## [419]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [430]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [441]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [452]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [463]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [474] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [485]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [496]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## [507]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [518]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [529] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [540]  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [551]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## [562]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [573]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [584]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [595]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [606]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [617]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [628]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [639]  TRUE  TRUE FALSE FALSE

Language Detection and Translation

We may be scraping web sites from many countries and need to detect the language and then translate it into English for mood scoring. The useful package textcat enables us to categorize the language.

library(textcat)
text = c("Je suis un programmeur novice.",
         "I am a programmer who is a novice.",
         "Sono un programmatore alle prime armi.",
         "Ich bin ein Anfänger Programmierer",
         "Soy un programador con errores.")

lang = textcat(text)
print(lang)
## [1] "french"  "english" "italian" "german"  "spanish"

Language Translation

And of course, once the language is detected, we may translate it into English.

library(translate)
set.key("AIzaSyDIB8qQTmhLlbPNN38Gs4dXnlN4a7lRrHQ")
print(translate(text[1],"fr","en"))
## list()
print(translate(text[3],"it","en"))
## list()
print(translate(text[4],"de","en"))
## list()
print(translate(text[5],"es","en"))
## list()

This requires a Google API for which you need to set up a paid account.

Text Classification

  1. Machine classification is, from a layman’s point of view, nothing but learning by example. In new-fangled modern parlance, it is a technique in the field of “machine learning”.

  2. Learning by machines falls into two categories, supervised and unsupervised. When a number of explanatory \(X\) variables are used to determine some outcome \(Y\), and we train an algorithm to do this, we are performing supervised (machine) learning. The outcome \(Y\) may be a dependent variable (for example, the left hand side in a linear regression), or a classification (i.e., discrete outcome).

  3. When we only have \(X\) variables and no separate outcome variable \(Y\), we perform unsupervised learning. For example, cluster analysis produces groupings based on the \(X\) variables of various entities, and is a common example.

Classification Algorithms

We start with a simple example on numerical data befoe discussing how this is to be applied to text. We first look at the Bayes classifier.

Bayes Classifier - 1

Bayes classification extends the Document-Term model with a document-term-classification model. These are the three entities in the model and we denote them as \((d,t,c)\). Assume that there are \(D\) documents to classify into \(C\) categories, and we employ a dictionary/lexicon (as the case may be) of \(T\) terms or words. Hence we have \(d_i, i = 1, ... , D\), and \(t_j, j = 1, ... , T\). And correspondingly the categories for classification are \(c_k, k = 1, ... , C\).

Bayes Classifier - 2

Suppose we are given a text corpus of stock market related documents (tweets for example), and wish to classify them into bullish (\(c_1\)), neutral (\(c_2\)), or bearish (\(c_3\)), where \(C=3\). We first need to train the Bayes classifier using a training data set, with pre-classified documents, numbering \(D\). For each term \(t\) in the lexicon, we can compute how likely it is to appear in documents in each class \(c_k\). Therefore, for each class, there is a \(T\)-sided dice with each face representing a term and having a probability of coming up. These dice are the prior probabilities of seeing a word for each class of document. We denote these probabilities succinctly as \(p(t | c)\). For example in a bearish document, if the word “sell” comprises 10% of the words that appear, then \(p(t=\mbox{sell} | c=\mbox{bearish})=0.10\).

Bayes Classifier - 3

In order to ensure that just because a word does not appear in a class, it has a non-zero probability we compute the probabilities as follows:

\[ \begin{equation} p(t | c) = \frac{n(t | c) + 1}{n(c)+T} \end{equation} \]

where \(n(t | c)\) is the number of times word \(t\) appears in category \(c\), and \(n(c) = \sum_t n(t | c)\) is the total number of words in the training data in class \(c\). Note that if there are no words in the class \(c\), then each term \(t\) has probability \(1/T\).

Bayes Classifier - 4

A document \(d_i\) is a collection or set of words \(t_j\). The probability of seeing a given document in each category is given by the following multinomial probability:

\[ \begin{equation} p(d | c) = \frac{n(d)!}{n(t_1|d)! \cdot n(t_2|d)! \cdots n(t_T|d)!} \times p(t_1 | c) \cdot p(t_2 | c) \cdots p(t_T | c) \nonumber \end{equation} \]

where \(n(d)\) is the number of words in the document, and \(n(t_j | d)\) is the number of occurrences of word \(t_j\) in the same document \(d\). These \(p(d | c)\) are the prior probabilities in the Bayes classifier, computed from all documents in the training data. The posterior probabilities are computed for each document in the test data as follows:

\[ \begin{equation} p(c | d) = \frac{p(d | c) p(c)}{\sum_k \; p(d | c_k) p(c_k)}, \forall k = 1, \ldots, C \nonumber \end{equation} \]

Note that we get \(C\) posterior probabilities for document \(d\), and assign the document to class \(\max_k c_k\), i.e., the class with the highest posterior probability for the given document.

Naive Bayes in R

We use the e1071 package. It has a one-line command that takes in the tagged training dataset using the function naiveBayes(). It returns the trained classifier model.

The trained classifier contains the unconditional probabilities \(p(c)\) of each class, which are merely frequencies with which each document appears. It also shows the conditional probability distributions \(p(t |c)\) given as the mean and standard deviation of the occurrence of these terms in each class. We may take this trained model and re-apply to the training data set to see how well it does. We use the predict() function for this. The data set here is the classic Iris data.

For text mining, the feature set in the data will be the set of all words, and there will be one column for each word. Hence, this will be a large feature set. In order to keep this small, we may instead reduce the number of words by only using a lexicon’s words as the set of features. This will vastly reduce and make more specific the feature set used in the classifier.

Example

library(e1071)
data(iris)
print(head(iris))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
tail(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica
#NAIVE BAYES
res = naiveBayes(iris[,1:4],iris[,5])
#SHOWS THE PRIOR AND LIKELIHOOD FUNCTIONS
res
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = iris[, 1:4], y = iris[, 5])
## 
## A-priori probabilities:
## iris[, 5]
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Conditional probabilities:
##             Sepal.Length
## iris[, 5]     [,1]      [,2]
##   setosa     5.006 0.3524897
##   versicolor 5.936 0.5161711
##   virginica  6.588 0.6358796
## 
##             Sepal.Width
## iris[, 5]     [,1]      [,2]
##   setosa     3.428 0.3790644
##   versicolor 2.770 0.3137983
##   virginica  2.974 0.3224966
## 
##             Petal.Length
## iris[, 5]     [,1]      [,2]
##   setosa     1.462 0.1736640
##   versicolor 4.260 0.4699110
##   virginica  5.552 0.5518947
## 
##             Petal.Width
## iris[, 5]     [,1]      [,2]
##   setosa     0.246 0.1053856
##   versicolor 1.326 0.1977527
##   virginica  2.026 0.2746501
#SHOWS POSTERIOR PROBABILITIES
predict(res,iris[,1:4],type="raw")
##               setosa   versicolor    virginica
##   [1,]  1.000000e+00 2.981309e-18 2.152373e-25
##   [2,]  1.000000e+00 3.169312e-17 6.938030e-25
##   [3,]  1.000000e+00 2.367113e-18 7.240956e-26
##   [4,]  1.000000e+00 3.069606e-17 8.690636e-25
##   [5,]  1.000000e+00 1.017337e-18 8.885794e-26
##   [6,]  1.000000e+00 2.717732e-14 4.344285e-21
##   [7,]  1.000000e+00 2.321639e-17 7.988271e-25
##   [8,]  1.000000e+00 1.390751e-17 8.166995e-25
##   [9,]  1.000000e+00 1.990156e-17 3.606469e-25
##  [10,]  1.000000e+00 7.378931e-18 3.615492e-25
##  [11,]  1.000000e+00 9.396089e-18 1.474623e-24
##  [12,]  1.000000e+00 3.461964e-17 2.093627e-24
##  [13,]  1.000000e+00 2.804520e-18 1.010192e-25
##  [14,]  1.000000e+00 1.799033e-19 6.060578e-27
##  [15,]  1.000000e+00 5.533879e-19 2.485033e-25
##  [16,]  1.000000e+00 6.273863e-17 4.509864e-23
##  [17,]  1.000000e+00 1.106658e-16 1.282419e-23
##  [18,]  1.000000e+00 4.841773e-17 2.350011e-24
##  [19,]  1.000000e+00 1.126175e-14 2.567180e-21
##  [20,]  1.000000e+00 1.808513e-17 1.963924e-24
##  [21,]  1.000000e+00 2.178382e-15 2.013989e-22
##  [22,]  1.000000e+00 1.210057e-15 7.788592e-23
##  [23,]  1.000000e+00 4.535220e-20 3.130074e-27
##  [24,]  1.000000e+00 3.147327e-11 8.175305e-19
##  [25,]  1.000000e+00 1.838507e-14 1.553757e-21
##  [26,]  1.000000e+00 6.873990e-16 1.830374e-23
##  [27,]  1.000000e+00 3.192598e-14 1.045146e-21
##  [28,]  1.000000e+00 1.542562e-17 1.274394e-24
##  [29,]  1.000000e+00 8.833285e-18 5.368077e-25
##  [30,]  1.000000e+00 9.557935e-17 3.652571e-24
##  [31,]  1.000000e+00 2.166837e-16 6.730536e-24
##  [32,]  1.000000e+00 3.940500e-14 1.546678e-21
##  [33,]  1.000000e+00 1.609092e-20 1.013278e-26
##  [34,]  1.000000e+00 7.222217e-20 4.261853e-26
##  [35,]  1.000000e+00 6.289348e-17 1.831694e-24
##  [36,]  1.000000e+00 2.850926e-18 8.874002e-26
##  [37,]  1.000000e+00 7.746279e-18 7.235628e-25
##  [38,]  1.000000e+00 8.623934e-20 1.223633e-26
##  [39,]  1.000000e+00 4.612936e-18 9.655450e-26
##  [40,]  1.000000e+00 2.009325e-17 1.237755e-24
##  [41,]  1.000000e+00 1.300634e-17 5.657689e-25
##  [42,]  1.000000e+00 1.577617e-15 5.717219e-24
##  [43,]  1.000000e+00 1.494911e-18 4.800333e-26
##  [44,]  1.000000e+00 1.076475e-10 3.721344e-18
##  [45,]  1.000000e+00 1.357569e-12 1.708326e-19
##  [46,]  1.000000e+00 3.882113e-16 5.587814e-24
##  [47,]  1.000000e+00 5.086735e-18 8.960156e-25
##  [48,]  1.000000e+00 5.012793e-18 1.636566e-25
##  [49,]  1.000000e+00 5.717245e-18 8.231337e-25
##  [50,]  1.000000e+00 7.713456e-18 3.349997e-25
##  [51,] 4.893048e-107 8.018653e-01 1.981347e-01
##  [52,] 7.920550e-100 9.429283e-01 5.707168e-02
##  [53,] 5.494369e-121 4.606254e-01 5.393746e-01
##  [54,]  1.129435e-69 9.999621e-01 3.789964e-05
##  [55,] 1.473329e-105 9.503408e-01 4.965916e-02
##  [56,]  1.931184e-89 9.990013e-01 9.986538e-04
##  [57,] 4.539099e-113 6.592515e-01 3.407485e-01
##  [58,]  2.549753e-34 9.999997e-01 3.119517e-07
##  [59,]  6.562814e-97 9.895385e-01 1.046153e-02
##  [60,]  5.000210e-69 9.998928e-01 1.071638e-04
##  [61,]  7.354548e-41 9.999997e-01 3.143915e-07
##  [62,]  4.799134e-86 9.958564e-01 4.143617e-03
##  [63,]  4.631287e-60 9.999925e-01 7.541274e-06
##  [64,] 1.052252e-103 9.850868e-01 1.491324e-02
##  [65,]  4.789799e-55 9.999700e-01 2.999393e-05
##  [66,]  1.514706e-92 9.787587e-01 2.124125e-02
##  [67,]  1.338348e-97 9.899311e-01 1.006893e-02
##  [68,]  2.026115e-62 9.999799e-01 2.007314e-05
##  [69,] 6.547473e-101 9.941996e-01 5.800427e-03
##  [70,]  3.016276e-58 9.999913e-01 8.739959e-06
##  [71,] 1.053341e-127 1.609361e-01 8.390639e-01
##  [72,]  1.248202e-70 9.997743e-01 2.256698e-04
##  [73,] 3.294753e-119 9.245812e-01 7.541876e-02
##  [74,]  1.314175e-95 9.979398e-01 2.060233e-03
##  [75,]  3.003117e-83 9.982736e-01 1.726437e-03
##  [76,]  2.536747e-92 9.865372e-01 1.346281e-02
##  [77,] 1.558909e-111 9.102260e-01 8.977398e-02
##  [78,] 7.014282e-136 7.989607e-02 9.201039e-01
##  [79,]  5.034528e-99 9.854957e-01 1.450433e-02
##  [80,]  1.439052e-41 9.999984e-01 1.601574e-06
##  [81,]  1.251567e-54 9.999955e-01 4.500139e-06
##  [82,]  8.769539e-48 9.999983e-01 1.742560e-06
##  [83,]  3.447181e-62 9.999664e-01 3.361987e-05
##  [84,] 1.087302e-132 6.134355e-01 3.865645e-01
##  [85,]  4.119852e-97 9.918297e-01 8.170260e-03
##  [86,] 1.140835e-102 8.734107e-01 1.265893e-01
##  [87,] 2.247339e-110 7.971795e-01 2.028205e-01
##  [88,]  4.870630e-88 9.992978e-01 7.022084e-04
##  [89,]  2.028672e-72 9.997620e-01 2.379898e-04
##  [90,]  2.227900e-69 9.999461e-01 5.390514e-05
##  [91,]  5.110709e-81 9.998510e-01 1.489819e-04
##  [92,]  5.774841e-99 9.885399e-01 1.146006e-02
##  [93,]  5.146736e-66 9.999591e-01 4.089540e-05
##  [94,]  1.332816e-34 9.999997e-01 2.716264e-07
##  [95,]  6.094144e-77 9.998034e-01 1.966331e-04
##  [96,]  1.424276e-72 9.998236e-01 1.764463e-04
##  [97,]  8.302641e-77 9.996692e-01 3.307548e-04
##  [98,]  1.835520e-82 9.988601e-01 1.139915e-03
##  [99,]  5.710350e-30 9.999997e-01 3.094739e-07
## [100,]  3.996459e-73 9.998204e-01 1.795726e-04
## [101,] 3.993755e-249 1.031032e-10 1.000000e+00
## [102,] 1.228659e-149 2.724406e-02 9.727559e-01
## [103,] 2.460661e-216 2.327488e-07 9.999998e-01
## [104,] 2.864831e-173 2.290954e-03 9.977090e-01
## [105,] 8.299884e-214 3.175384e-07 9.999997e-01
## [106,] 1.371182e-267 3.807455e-10 1.000000e+00
## [107,] 3.444090e-107 9.719885e-01 2.801154e-02
## [108,] 3.741929e-224 1.782047e-06 9.999982e-01
## [109,] 5.564644e-188 5.823191e-04 9.994177e-01
## [110,] 2.052443e-260 2.461662e-12 1.000000e+00
## [111,] 8.669405e-159 4.895235e-04 9.995105e-01
## [112,] 4.220200e-163 3.168643e-03 9.968314e-01
## [113,] 4.360059e-190 6.230821e-06 9.999938e-01
## [114,] 6.142256e-151 1.423414e-02 9.857659e-01
## [115,] 2.201426e-186 1.393247e-06 9.999986e-01
## [116,] 2.949945e-191 6.128385e-07 9.999994e-01
## [117,] 2.909076e-168 2.152843e-03 9.978472e-01
## [118,] 1.347608e-281 2.872996e-12 1.000000e+00
## [119,] 2.786402e-306 1.151469e-12 1.000000e+00
## [120,] 2.082510e-123 9.561626e-01 4.383739e-02
## [121,] 2.194169e-217 1.712166e-08 1.000000e+00
## [122,] 3.325791e-145 1.518718e-02 9.848128e-01
## [123,] 6.251357e-269 1.170872e-09 1.000000e+00
## [124,] 4.415135e-135 1.360432e-01 8.639568e-01
## [125,] 6.315716e-201 1.300512e-06 9.999987e-01
## [126,] 5.257347e-203 9.507989e-06 9.999905e-01
## [127,] 1.476391e-129 2.067703e-01 7.932297e-01
## [128,] 8.772841e-134 1.130589e-01 8.869411e-01
## [129,] 5.230800e-194 1.395719e-05 9.999860e-01
## [130,] 7.014892e-179 8.232518e-04 9.991767e-01
## [131,] 6.306820e-218 1.214497e-06 9.999988e-01
## [132,] 2.539020e-247 4.668891e-10 1.000000e+00
## [133,] 2.210812e-201 2.000316e-06 9.999980e-01
## [134,] 1.128613e-128 7.118948e-01 2.881052e-01
## [135,] 8.114869e-151 4.900992e-01 5.099008e-01
## [136,] 7.419068e-249 1.448050e-10 1.000000e+00
## [137,] 1.004503e-215 9.743357e-09 1.000000e+00
## [138,] 1.346716e-167 2.186989e-03 9.978130e-01
## [139,] 1.994716e-128 1.999894e-01 8.000106e-01
## [140,] 8.440466e-185 6.769126e-06 9.999932e-01
## [141,] 2.334365e-218 7.456220e-09 1.000000e+00
## [142,] 2.179139e-183 6.352663e-07 9.999994e-01
## [143,] 1.228659e-149 2.724406e-02 9.727559e-01
## [144,] 3.426814e-229 6.597015e-09 1.000000e+00
## [145,] 2.011574e-232 2.620636e-10 1.000000e+00
## [146,] 1.078519e-187 7.915543e-07 9.999992e-01
## [147,] 1.061392e-146 2.770575e-02 9.722942e-01
## [148,] 1.846900e-164 4.398402e-04 9.995602e-01
## [149,] 1.439996e-195 3.384156e-07 9.999997e-01
## [150,] 2.771480e-143 5.987903e-02 9.401210e-01
#CONFUSION MATRIX
out = table(predict(res,iris[,1:4]),iris[,5])
out
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47

Support Vector Machines (SVM) - 1

The goal of the SVM is to map a set of entities with inputs \(X=\{x_1,x_2,\ldots,x_n\}\) of dimension \(n\), i.e., \(X \in R^n\), into a set of categories \(Y=\{y_1,y_2,\ldots,y_m\}\) of dimension \(m\), such that the \(n\)-dimensional \(X\)-space is divided using hyperplanes, which result in the maximal separation between classes \(Y\). A hyperplane is the set of points \({\bf x}\) satisfying the equation

\[ {\bf w} \cdot {\bf x} = b \]

where \(b\) is a scalar constant, and \({\bf w} \in R^n\) is the normal vector to the hyperplane, i.e., the vector at right angles to the plane. The distance between this hyperplane and \({\bf w} \cdot {\bf x} = 0\) is given by \(b/||{\bf w}||\), where \(||{\bf w}||\) is the norm of vector \({\bf w}\).

SVM - 2

This set up is sufficient to provide intuition about how the SVM is implemented. Suppose we have two categories of data, i.e., \(y = \{y_1, y_2\}\). Assume that all points in category \(y_1\) lie above a hyperplane \({\bf w} \cdot {\bf x} = b_1\), and all points in category \(y_2\) lie below a hyperplane \({\bf w} \cdot {\bf x} = b_2\), then the distance between the two hyperplanes is \(\frac{|b_1-b_2|}{||{\bf w}||}\).

#Example of hyperplane geometry
w1 = 1; w2 = 2
b1 = 10
#Plot hyperplane in x1, x2 space
x1 = seq(-3,3,0.1)
x2 = (b1-w1*x1)/w2
plot(x1,x2,type="l")
#Create hyperplane 2
b2 = 8
x2 = (b2-w1*x1)/w2
lines(x1,x2,col="red")

#Compute distance to hyperplane 2
print(abs(b1-b2)/sqrt(w1^2+w2^2))
## [1] 0.8944272

We see that this gives the perpendicular distance between the two parallel hyperplanes.

The goal of the SVM is to maximize the distance (separation) between the two hyperplanes, and this is achieved by minimizing norm \(||{\bf w}||\). This naturally leads to a quadratic optimization problem.

\[ \begin{equation} \min_{b_1,b_2,{\bf w}} \frac{1}{2} ||{\bf w}|| \end{equation} \]

subject to \({\bf w} \cdot {\bf x} \geq b_1\) for points in category \(y_1\) and \({\bf w} \cdot {\bf x} \leq b_2\) for points in category \(y_2\). Note that this program may find a solution where many of the elements of \({\bf w}\) are zero, i.e., it also finds the minimal set of “support” vectors that separate the two groups. The “half” in front of the minimand is for mathematical convenience in solving the quadratic program.

SVM - 3

Of course, there may be no linear hyperplane that perfectly separates the two groups. This slippage may be accounted for in the SVM by allowing for points on the wrong side of the separating hyperplanes using cost functions, i.e., we modify the quadratic program as follows:

\[ \begin{equation} \min_{b_1,b_2,{\bf w},\{\eta_i\}} \frac{1}{2} ||{\bf w}|| + C_1 \sum_{i=1}^n \eta_i + C_2 \sum_{i=1}^n \eta_i \end{equation} \] where \(C_1,C_2\) are the costs for slippage in groups 1 and 2, respectively. Often implementations assume \(C_1=C_2\). The values \(\eta_i\) are positive for observations that are not perfectly separated, i.e., lead to slippage. Thus, for group 1, these are the length of the perpendicular amounts by which observation \(i\) lies below the hyperplane \({\bf w} \cdot {\bf x} = b_1\), i.e., lies on the hyperplane \({\bf w} \cdot {\bf x} = b_1 - \eta_i\). For group 1, these are the length of the perpendicular amounts by which observation \(i\) lies above the hyperplane \({\bf w} \cdot {\bf x} = b_2\), i.e., lies on the hyperplane \({\bf w} \cdot {\bf x} = b_1 + \eta_i\). For observations within the respective hyperplanes, of course, \(\eta_i=0\).

Example of SVM with Confusion Matrix

library(e1071)

#EXAMPLE 1 for SVM
model = svm(iris[,1:4],iris[,5])
model
## 
## Call:
## svm.default(x = iris[, 1:4], y = iris[, 5])
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  51
out = predict(model,iris[,1:4])
out
##          1          2          3          4          5          6 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##          7          8          9         10         11         12 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         13         14         15         16         17         18 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         19         20         21         22         23         24 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         25         26         27         28         29         30 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         31         32         33         34         35         36 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         37         38         39         40         41         42 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         43         44         45         46         47         48 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         49         50         51         52         53         54 
##     setosa     setosa versicolor versicolor versicolor versicolor 
##         55         56         57         58         59         60 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         61         62         63         64         65         66 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         67         68         69         70         71         72 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         73         74         75         76         77         78 
## versicolor versicolor versicolor versicolor versicolor  virginica 
##         79         80         81         82         83         84 
## versicolor versicolor versicolor versicolor versicolor  virginica 
##         85         86         87         88         89         90 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         91         92         93         94         95         96 
## versicolor versicolor versicolor versicolor versicolor versicolor 
##         97         98         99        100        101        102 
## versicolor versicolor versicolor versicolor  virginica  virginica 
##        103        104        105        106        107        108 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        109        110        111        112        113        114 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        115        116        117        118        119        120 
##  virginica  virginica  virginica  virginica  virginica versicolor 
##        121        122        123        124        125        126 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        127        128        129        130        131        132 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        133        134        135        136        137        138 
##  virginica versicolor  virginica  virginica  virginica  virginica 
##        139        140        141        142        143        144 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        145        146        147        148        149        150 
##  virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
print(length(out))
## [1] 150
table(matrix(out),iris[,5])
##             
##              setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         2
##   virginica       0          2        48

So it does marginally better than naive Bayes. Here is another example.

Another example

#EXAMPLE 2 for SVM
train_data = matrix(rpois(60,3),10,6)
print(train_data)
##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]    6    1    6    3    5    4
##  [2,]    2    2    2    3    2    4
##  [3,]    3    0    4    2    4    5
##  [4,]    3    3    4    4    1    3
##  [5,]    4    7    6    3    4    0
##  [6,]    6    1    1    2    4    2
##  [7,]    1    5    4    3    3    6
##  [8,]    4    1    3    5    2    3
##  [9,]    3    3    4    4    3    4
## [10,]    1    4    4    4    6    6
train_class = as.matrix(c(2,3,1,2,2,1,3,2,3,3))
print(train_class)
##       [,1]
##  [1,]    2
##  [2,]    3
##  [3,]    1
##  [4,]    2
##  [5,]    2
##  [6,]    1
##  [7,]    3
##  [8,]    2
##  [9,]    3
## [10,]    3
library(e1071)
model = svm(train_data,train_class)
model
## 
## Call:
## svm.default(x = train_data, y = train_class)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1666667 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  9
pred = predict(model,train_data, type="raw")
table(pred,train_class)
##                   train_class
## pred               1 2 3
##   1.29176440936163 1 0 0
##   1.68430545397683 1 0 0
##   1.9381669826315  0 1 0
##   2.07882376164611 0 1 0
##   2.07885763288189 0 1 0
##   2.12319491068171 0 1 0
##   2.60099597219594 0 0 1
##   2.66874438847281 0 0 1
##   2.921095532933   0 0 1
##   2.92122308190957 0 0 1
train_fitted = round(pred,0)
print(cbind(train_class,train_fitted))
##      train_fitted
## 1  2            2
## 2  3            3
## 3  1            2
## 4  2            2
## 5  2            2
## 6  1            1
## 7  3            3
## 8  2            2
## 9  3            3
## 10 3            3
train_fitted = matrix(train_fitted)
table(train_class,train_fitted)
##            train_fitted
## train_class 1 2 3
##           1 1 1 0
##           2 0 4 0
##           3 0 0 4

Statistical Significance of the Confusion Matrix

How do we know if the confusion matrix shows statistically significant classification power? We do a chi-square test.

library(e1071)
res = naiveBayes(iris[,1:4],iris[,5])
pred = predict(res,iris[,1:4])
out = table(pred,iris[,5])
out
##             
## pred         setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47
chisq.test(out)
## 
##  Pearson's Chi-squared test
## 
## data:  out
## X-squared = 266.16, df = 4, p-value < 2.2e-16

Word count classifiers, adjectives, and adverbs

  1. Given a lexicon of selected words, one may sign the words as positive or negative, and then do a simple word count to compute net sentiment or mood of text. By establishing appropriate cut offs, one can determine the classification of text into optimistic, neutral, or pessimistic. These cut offs are determined using the training and testing data sets.

  2. Word count classifiers may be enhanced by focusing on “emphasis words” such as adjectives and adverbs, especially when classifying emotive content. One approach used in Das and Chen (2007) is to identify all adjectives and adverbs in the text and then only consider words that are within \(\pm 3\) words before and after the adjective or adverb. This extracts the most emphatic parts of the text only, and then mood scores it.

Fisher’s discriminant

\[ \begin{equation} F(w) = \frac{\frac{1}{K} \sum_{j=1}^K ({\bar w}_j - {\bar w}_0)^2}{\frac{1}{K} \sum_{j=1}^K \sigma_j^2} \nonumber \end{equation} \]

where \(K\) is the number of categories and \({\bar w}_j\) is the mean occurrence of the word \(w\) in each text in category \(j\), and \({\bar w}_0\) is the mean occurrence across all categories. And \(\sigma_j^2\) is the variance of the word occurrence in category \(j\). This is just one way in which Fisher’s discriminant may be calculated, and there are other variations on the theme.

Vector-Distance Classifier

Suppose we have 500 documents in each of two categories, bullish and bearish. These 1,000 documents may all be placed as points in \(n\)-dimensional space. It is more than likely that the points in each category will lie closer to each other than to the points in the other category. Now, if we wish to classify a new document, with vector \(D_i\), the obvious idea is to look at which cluster it is closest to, or which point in either cluster it is closest to. The closeness between two documents \(i\) and \(j\) is determined easily by the well known metric of cosine distance, i.e.,

\[ \begin{equation} 1 - \cos(\theta_{ij}) = 1 - \frac{D_i^\top D_j}{||D_i|| \cdot ||D_j||} \nonumber \end{equation} \]

where \(||D_i|| = \sqrt{D_i^\top D_i}\) is the norm of the vector \(D_i\). The cosine of the angle between the two document vectors is 1 if the two vectors are identical, and in this case the distance between them would be zero.

Metrics: Confusion matrix

The confusion matrix is the classic tool for assessing classification accuracy. Given \(n\) categories, the matrix is of dimension \(n \times n\). The rows relate to the category assigned by the analytic algorithm and the columns refer to the correct category in which the text resides. Each cell \((i,j)\) of the matrix contains the number of text messages that were of type \(j\) and were classified as type \(i\). The cells on the diagonal of the confusion matrix state the number of times the algorithm got the classification right. All other cells are instances of classification error. If an algorithm has no classification ability, then the rows and columns of the matrix will be independent of each other. Under this null hypothesis, the statistic that is examined for rejection is as follows:

\[ \chi^2[dof=(n-1)^2] = \sum_{i=1}^n \sum_{j=1}^n \frac{[A(i,j) - E(i,j)]^2}{E(i,j)} \]

where \(A(i,j)\) are the actual numbers observed in the confusion matrix, and \(E(i,j)\) are the expected numbers, assuming no classification ability under the null. If \(T(i)\) represents the total across row \(i\) of the confusion matrix, and \(T(j)\) the column total, then

\[ E(i,j) = \frac{T(i) \times T(j)}{\sum_{i=1}^n T(i)} \equiv \frac{T(i) \times T(j)}{\sum_{j=1}^n T(j)} \]

The degrees of freedom of the \(\chi^2\) statistic is \((n-1)^2\). This statistic is very easy to implement and may be applied to models for any \(n\). A highly significant statistic is evidence of classification ability.

Accuracy

Algorithm accuracy over a classification scheme is the percentage of text that is correctly classified. This may be done in-sample or out-of-sample. To compute this off the confusion matrix, we calculate

\[ \mbox{Accuracy} = \frac{ \sum_{i=1}^K O(i,i)}{\sum_{j=1}^K M(j)} = \frac{ \sum_{i=1}^K O(i,i)}{\sum_{i=1}^K M(i)} \]

We should hope that this is at least greater than \(1/K\), which is the accuracy level achieved on average from random guessing.

Sentiment over Time

Stock Sentiment Correlations

Phase Lag Analysis

False Positives

  1. The percentage of false positives is a useful metric to work with. It may be calculated as a simple count or as a weighted count (by nearness of wrong category) of false classifications divided by total classifications undertaken.

  2. For example, assume that in the example above, category 1 is BULLISH and category 3 is BEARISH, whereas category 2 is NEUTRAL. The false positives would arise from mis-classifying category 1 as 3 and vice-versa. We compute the false positive rate for illustration.

  3. The false positive rate is just 1% in the example below.

Omatrix = matrix(c(22,1,0,3,44,3,1,1,25),3,3)
print((Omatrix[1,3]+Omatrix[3,1])/sum(Omatrix))
## [1] 0.01

Sentiment Error

In a 3-way classification scheme, where category 1 is BULLISH and category 3 is BEARISH, whereas category 2 is NEUTRAL, we can compute this metric as follows.

\[ \begin{equation} \mbox{Sentiment Error} = 1 - \frac{M(j=1)-M(j=3)}{M(i=1)-M(i=3)} \nonumber \end{equation} \]

In our illustrative example, we may easily calculate this metric. The classified sentiment from the algorithm was \(-3 = 23-27\), whereas it actually should have been \(-2 = 26-28\). The percentage error in sentiment is 50%.

print(Omatrix)
##      [,1] [,2] [,3]
## [1,]   22    3    1
## [2,]    1   44    1
## [3,]    0    3   25
rsum = rowSums(Omatrix)
csum = colSums(Omatrix)
print(rsum)
## [1] 26 46 28
print(csum)
## [1] 23 50 27
print(1 - (-3)/(-2))
## [1] -0.5

Disagreement

The metric uses the number of signed buys and sells in the day (based on a sentiment model) to determine how much difference of opinion there is in the market. The metric is computed as follows:

\[ \mbox{DISAG} = \left| 1 - \left| \frac{B-S}{B+S} \right| \right| \]

where \(B, S\) are the numbers of classified buys and sells. Note that DISAG is bounded between zero and one.

Using the true categories of buys (category 1 BULLISH) and sells (category 3 BEARISH) in the same example as before, we may compute disagreement. Since there is little agreement (26 buys and 28 sells), disagreement is high.

print(Omatrix)
##      [,1] [,2] [,3]
## [1,]   22    3    1
## [2,]    1   44    1
## [3,]    0    3   25
DISAG = abs(1-abs((26-28)/(26+28)))
print(DISAG)
## [1] 0.962963

Precision and Recall

The creation of the confusion matrix leads naturally to two measures that are associated with it.

Precision is the fraction of positives identified that are truly positive, and is also known as positive predictive value. It is a measure of usefulness of prediction. So if the algorithm (say) was tasked with selecting those account holders on LinkedIn who are actually looking for a job, and it identifies \(n\) such people of which only \(m\) were really looking for a job, then the precision would be \(m/n\).

Recall is the proportion of positives that are correctly identified, and is also known as sensitivity. It is a measure of how complete the prediction is. If the actual number of people looking for a job on LinkedIn was \(M\), then recall would be \(n/M\).

For example, suppose we have the following confusion matrix.

Actual
Predicted Looking for Job Not Looking
Looking for Job 10 2 12
Not Looking 1 16 17
11 18 29

In this case precision is \(10/12\) and recall is \(10/11\). Precision is related to the probability of false positives (Type I error), which is one minus precision. Recall is related to the probability of false negatives (Type II error), which is one minus recall.

Using the RTextTools package

This package bundles text classification algorithms into one package.

library(tm)
library(RTextTools)
## Loading required package: SparseM
## Warning: package 'SparseM' was built under R version 3.2.5
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## 
## Attaching package: 'RTextTools'
## The following objects are masked from 'package:SnowballC':
## 
##     getStemLanguages, wordStem
#Create sample text with positive and negative markers
n = 1000
npos = round(runif(n,1,25))
nneg = round(runif(n,1,25))
flag = matrix(0,n,1)
flag[which(npos>nneg)] = 1
text = NULL
for (j in 1:n) {
  res = paste(c(sample(poswords,npos[j]),sample(negwords,nneg[j])),collapse=" ")
  text = c(text,res)
}

#Text Classification
m = create_matrix(text)
print(m)
## <<DocumentTermMatrix (documents: 1000, terms: 3707)>>
## Non-/sparse entries: 25755/3681245
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency (tf)
m = create_matrix(text,weighting=weightTfIdf)
print(m)
## <<DocumentTermMatrix (documents: 1000, terms: 3707)>>
## Non-/sparse entries: 25755/3681245
## Sparsity           : 99%
## Maximal term length: 17
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
container <- create_container(m,flag,trainSize=1:(n/2), testSize=(n/2+1):n,virgin=FALSE)
#models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","SLDA","TREE","BAGGING","BOOSTING","RF"))
models <- train_models(container, algorithms=c("MAXENT","SVM","GLMNET","TREE"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)

#RESULTS
analytics@algorithm_summary # SUMMARY OF PRECISION, RECALL, F-SCORES, AND ACCURACY SORTED BY TOPIC CODE FOR EACH ALGORITHM
##   SVM_PRECISION SVM_RECALL SVM_FSCORE GLMNET_PRECISION GLMNET_RECALL
## 0           0.8       0.82       0.81             0.59          0.78
## 1           0.8       0.78       0.79             0.64          0.41
##   GLMNET_FSCORE TREE_PRECISION TREE_RECALL TREE_FSCORE
## 0          0.67           0.54        0.93        0.68
## 1          0.50           0.65        0.15        0.24
##   MAXENTROPY_PRECISION MAXENTROPY_RECALL MAXENTROPY_FSCORE
## 0                 0.81              0.79              0.80
## 1                 0.78              0.80              0.79
analytics@label_summary # SUMMARY OF LABEL (e.g. TOPIC) ACCURACY
##   NUM_MANUALLY_CODED NUM_CONSENSUS_CODED NUM_PROBABILITY_CODED
## 0                259                 392                   277
## 1                241                 108                   223
##   PCT_CONSENSUS_CODED PCT_PROBABILITY_CODED PCT_CORRECTLY_CODED_CONSENSUS
## 0           151.35135             106.94981                      91.89189
## 1            44.81328              92.53112                      36.09959
##   PCT_CORRECTLY_CODED_PROBABILITY
## 0                        79.53668
## 1                        70.53942
analytics@document_summary # RAW SUMMARY OF ALL DATA AND SCORING
##     MAXENTROPY_LABEL MAXENTROPY_PROB SVM_LABEL  SVM_PROB GLMNET_LABEL
## 1                  1       0.8579293         1 0.8402757            0
## 2                  0       0.9889092         0 0.7917711            0
## 3                  1       0.8771239         1 0.9510173            1
## 4                  1       0.7512495         1 0.5171318            0
## 5                  1       0.6274731         1 0.7079830            0
## 6                  1       0.6523223         0 0.5476286            0
## 7                  1       0.7819357         1 0.5937867            0
## 8                  0       0.9114306         0 0.8524521            0
## 9                  0       0.9481654         0 0.7368287            0
## 10                 0       0.9240892         0 0.7124702            0
## 11                 0       0.5038318         0 0.6281052            0
## 12                 1       0.7534973         1 0.5859398            1
## 13                 1       0.5959040         0 0.8325437            0
## 14                 0       0.8346023         0 0.7314251            0
## 15                 0       0.9162477         0 0.5597547            0
## 16                 0       0.9877611         0 0.9007317            0
## 17                 0       0.8269503         0 0.5219729            0
## 18                 0       0.8783712         0 0.5331234            0
## 19                 0       0.5754839         1 0.6368268            0
## 20                 1       0.6036500         1 0.6717878            0
## 21                 1       0.9125837         1 0.6700221            0
## 22                 1       0.5517872         0 0.5522943            0
## 23                 1       0.9779073         1 0.8457107            1
## 24                 0       0.5791141         0 0.5764542            0
## 25                 1       0.9896532         1 0.9452910            1
## 26                 0       0.8636291         0 0.6052753            0
## 27                 0       0.9996259         0 0.9644365            0
## 28                 0       0.7013461         0 0.7835032            1
## 29                 0       0.7532735         0 0.6470991            0
## 30                 1       0.6994149         1 0.5361563            0
## 31                 0       0.9840143         0 0.6337562            1
## 32                 1       0.9693421         1 0.8075273            1
## 33                 1       0.9773058         1 0.9323526            1
## 34                 1       0.9183425         1 0.6772988            0
## 35                 1       0.9468948         1 0.7421286            1
## 36                 0       0.9477714         0 0.8461861            0
## 37                 1       0.6455413         0 0.5373315            0
## 38                 1       0.9946571         1 0.6988711            0
## 39                 1       0.9992862         1 0.9440295            0
## 40                 0       0.9999305         0 0.9579143            0
## 41                 1       0.9961413         1 0.8555121            0
## 42                 0       0.9990200         0 0.9716997            0
## 43                 1       0.7513958         0 0.5090049            0
## 44                 0       0.9763378         0 0.7047927            1
## 45                 0       0.8208638         0 0.7682049            0
## 46                 1       0.5955903         0 0.5808784            0
## 47                 0       0.9637975         0 0.9045674            0
## 48                 1       0.9816336         1 0.8125144            1
## 49                 1       0.8275467         1 0.6484182            1
## 50                 0       0.9918852         0 0.8352544            0
## 51                 0       0.8478272         0 0.6124079            0
## 52                 1       0.6703351         1 0.5165522            0
## 53                 0       0.5111346         0 0.6634197            0
## 54                 0       0.5812117         0 0.6333654            0
## 55                 1       0.9928267         1 0.8860294            0
## 56                 1       0.7327049         1 0.5830705            0
## 57                 0       0.9730584         0 0.8657668            0
## 58                 0       0.9968959         0 0.9085251            1
## 59                 0       0.5414529         0 0.5438450            1
## 60                 1       0.8750862         1 0.5107863            0
## 61                 1       0.7745912         1 0.6563022            1
## 62                 1       0.8836069         1 0.7450693            1
## 63                 0       0.8662524         1 0.5233994            0
## 64                 1       0.9963765         1 0.8794826            1
## 65                 1       0.9211440         1 0.8087564            1
## 66                 0       0.9624085         0 0.7444767            1
## 67                 1       0.5772304         0 0.6569568            0
## 68                 0       0.9910539         0 0.9090005            0
## 69                 0       0.9783718         0 0.8322591            0
## 70                 0       0.7374812         0 0.6836114            0
## 71                 1       0.9999973         1 0.9980898            1
## 72                 1       0.8324743         1 0.5215142            0
## 73                 1       0.6593330         1 0.6342903            0
## 74                 1       0.9724665         1 0.5740967            1
## 75                 0       0.7777847         0 0.6719141            0
## 76                 0       0.9615039         0 0.8009526            0
## 77                 0       0.8134608         1 0.5497218            1
## 78                 1       0.8454406         1 0.5883797            0
## 79                 1       0.9479331         0 0.5360295            1
## 80                 1       0.9230795         1 0.8210145            0
## 81                 0       0.9203069         0 0.6967228            0
## 82                 1       0.8738333         1 0.7200714            1
## 83                 0       0.7767262         0 0.6966733            1
## 84                 1       0.6662228         1 0.6142579            0
## 85                 0       0.9558573         0 0.8713604            0
## 86                 0       0.5209858         0 0.6368313            0
## 87                 0       0.9834270         0 0.6966126            0
## 88                 0       0.5039243         1 0.5810159            1
## 89                 0       0.6592179         0 0.5473700            0
## 90                 1       0.8206025         0 0.5857905            1
## 91                 0       0.9485201         0 0.7013216            0
## 92                 0       0.9788151         0 0.7388368            0
## 93                 1       0.9157100         1 0.7152854            0
## 94                 1       0.9132725         1 0.7660061            1
## 95                 1       0.9358443         1 0.7086191            0
## 96                 1       0.5151482         0 0.5249791            0
## 97                 1       0.5463302         0 0.6979842            0
## 98                 1       0.6657931         0 0.5744932            0
## 99                 1       0.9573048         1 0.5617906            0
## 100                1       0.5914636         1 0.5875220            1
## 101                0       0.9937141         0 0.9468874            0
## 102                1       0.9889676         1 0.8175819            1
## 103                1       0.6751737         1 0.6128321            1
## 104                0       0.5129555         1 0.5417314            1
## 105                1       0.7528127         1 0.7529493            0
## 106                1       0.9946818         1 0.9455856            1
## 107                0       0.6762017         0 0.6156119            1
## 108                0       0.7763997         0 0.5316527            0
## 109                0       0.9818108         0 0.7262813            0
## 110                1       0.8674041         1 0.7825652            0
## 111                0       0.7478253         0 0.6084569            0
## 112                1       0.9836338         1 0.8543653            1
## 113                0       0.9634648         0 0.9359335            0
## 114                0       0.8182412         0 0.8653513            0
## 115                1       0.7047897         0 0.5145153            0
## 116                1       0.9200210         1 0.7607339            0
## 117                0       0.9933144         0 0.9131847            0
## 118                0       0.9025121         0 0.6184797            0
## 119                1       0.9937498         1 0.9080192            1
## 120                1       0.5402749         0 0.5658089            0
## 121                0       0.9753441         0 0.9230138            0
## 122                1       0.9706926         1 0.5247743            0
## 123                0       0.6742449         0 0.7398689            0
## 124                0       0.8742186         0 0.7326573            0
## 125                0       0.5335593         1 0.5415353            0
## 126                0       0.8910276         0 0.8034074            0
## 127                0       0.9934759         0 0.9413915            0
## 128                0       0.9770633         0 0.8111901            0
## 129                0       0.9974013         0 0.9472345            0
## 130                0       0.9927454         0 0.9288016            0
## 131                1       0.9924156         1 0.9112208            0
## 132                0       0.9762022         0 0.8702204            0
## 133                1       0.9753263         1 0.6592442            0
## 134                1       0.9999129         1 0.9328961            1
## 135                1       0.5528284         0 0.6754797            1
## 136                1       0.7963306         1 0.7210398            0
## 137                0       0.7610011         1 0.5000000            0
## 138                1       0.9552446         1 0.8322440            1
## 139                1       0.8913725         1 0.6311903            0
## 140                0       0.9564910         0 0.7348463            0
## 141                0       0.9975408         0 0.8700072            0
## 142                1       0.9429426         1 0.7539096            0
## 143                0       0.9763186         0 0.8522256            0
## 144                0       0.9504727         0 0.8119919            0
## 145                0       0.6506800         0 0.7310967            0
## 146                0       0.6387208         0 0.5352087            0
## 147                0       0.9072185         0 0.7639297            0
## 148                0       0.7050164         0 0.6584521            0
## 149                0       0.9671963         0 0.8330123            1
## 150                1       0.8606746         0 0.6428315            0
## 151                1       0.9778635         1 0.7972212            0
## 152                0       0.8921047         1 0.6049122            1
## 153                0       0.7423709         0 0.5317705            1
## 154                0       0.9402686         0 0.8672841            0
## 155                0       0.6657072         0 0.7161006            0
## 156                0       0.9992692         0 0.9448097            0
## 157                1       0.7233352         1 0.5800949            1
## 158                1       0.9985942         1 0.9708887            0
## 159                0       0.6700544         0 0.7031450            0
## 160                1       0.9872401         1 0.6853467            0
## 161                0       0.7979641         0 0.6832418            0
## 162                0       0.7472300         0 0.6975315            1
## 163                0       0.5928332         0 0.7230411            0
## 164                0       0.9837886         0 0.8250189            0
## 165                0       0.6306148         1 0.6488264            1
## 166                1       0.9104182         1 0.6929846            0
## 167                0       0.9316773         0 0.8365241            0
## 168                0       0.8405723         0 0.6611237            0
## 169                0       0.9565921         0 0.7508357            0
## 170                0       0.7110081         0 0.6531900            1
## 171                1       0.7560378         1 0.7902000            1
## 172                0       0.9035350         0 0.5478064            0
## 173                1       0.9217055         1 0.7377317            0
## 174                0       0.9724004         0 0.7529122            0
## 175                1       0.7339565         1 0.6350940            0
## 176                1       0.9478130         1 0.7114427            1
## 177                0       0.9935900         0 0.8726856            0
## 178                1       0.6289705         1 0.5480000            0
## 179                0       0.5311258         0 0.5879138            0
## 180                1       0.9925060         1 0.8921324            1
## 181                1       0.9137843         1 0.7270650            0
## 182                0       0.9519986         0 0.8810549            0
## 183                0       0.7883762         0 0.5468736            1
## 184                1       0.9081567         1 0.7568435            1
## 185                0       0.8683790         0 0.5569881            0
## 186                0       0.7008180         0 0.5625275            0
## 187                1       0.9973144         1 0.9299310            0
## 188                0       0.9519957         0 0.9115386            0
## 189                0       0.5611704         1 0.5000000            0
## 190                1       0.7220722         1 0.6971641            1
## 191                0       0.6198362         0 0.5935343            0
## 192                1       0.9846100         1 0.8471466            1
## 193                0       0.9327107         0 0.7992874            0
## 194                0       0.7101888         0 0.5334756            0
## 195                1       0.7648032         1 0.6125674            1
## 196                1       0.9854055         1 0.7351516            1
## 197                1       0.9762002         1 0.8398882            0
## 198                0       0.9979624         0 0.9370254            0
## 199                0       0.6803242         0 0.6342106            0
## 200                0       0.6705083         1 0.5242053            1
## 201                0       0.7522277         0 0.5556011            0
## 202                1       0.6863030         1 0.5176115            0
## 203                1       0.6580211         1 0.5716186            0
## 204                0       0.8634015         0 0.8077151            0
## 205                1       0.8436331         1 0.6238419            0
## 206                1       0.9683306         1 0.8070274            0
## 207                1       0.9522856         1 0.6924892            1
## 208                1       0.7720141         1 0.6598490            0
## 209                1       0.9958683         1 0.8699881            0
## 210                0       0.9862749         0 0.8041823            0
## 211                0       0.5969224         1 0.7062762            1
## 212                0       0.9336405         0 0.7558863            0
## 213                1       0.9977380         1 0.9286261            0
## 214                0       0.9860395         0 0.9325511            0
## 215                0       0.9977886         0 0.9420480            0
## 216                0       0.9401983         0 0.8076000            0
## 217                1       0.9013253         1 0.6563627            0
## 218                0       0.5187154         0 0.5842494            0
## 219                0       0.6690334         1 0.6339251            0
## 220                0       0.9456176         0 0.8697507            0
## 221                1       0.8586761         1 0.5529408            1
## 222                0       0.8862194         0 0.7476264            1
## 223                1       0.9225414         1 0.5213904            0
## 224                1       0.7841996         1 0.5323519            0
## 225                0       0.9991583         0 0.9481004            1
## 226                1       0.9304379         1 0.6965753            0
## 227                0       0.5710796         0 0.5286048            0
## 228                1       0.8308168         1 0.7026651            1
## 229                1       0.8601532         1 0.5515215            1
## 230                1       0.9674282         1 0.7099447            0
## 231                0       0.7017201         0 0.7500263            1
## 232                0       0.5223424         0 0.5729081            0
## 233                0       0.9121292         0 0.5948308            0
## 234                0       0.9299358         0 0.6050689            1
## 235                0       0.9795657         0 0.8375783            0
## 236                1       0.9461373         1 0.7772152            0
## 237                1       0.8068785         1 0.5183433            0
## 238                0       0.9939132         0 0.9481846            0
## 239                1       0.9950434         1 0.9235710            0
## 240                1       0.9115514         0 0.5080233            0
## 241                1       0.5365909         1 0.6667368            1
## 242                0       0.9988258         0 0.9395797            0
## 243                0       0.9992458         0 0.9331821            1
## 244                1       0.8956701         1 0.5742082            0
## 245                0       0.8813561         0 0.6236588            0
## 246                1       0.7937640         1 0.7213871            1
## 247                0       0.9238146         0 0.6728945            0
## 248                1       0.9996548         1 0.9581935            1
## 249                0       0.9871408         0 0.8609307            0
## 250                1       0.9993370         1 0.9383927            1
## 251                0       0.9356324         0 0.7329846            0
## 252                1       0.9840480         1 0.8597488            0
## 253                0       0.9869608         0 0.9131760            0
## 254                1       0.9982424         1 0.8869734            0
## 255                0       0.9138004         0 0.6861681            0
## 256                0       0.6858577         1 0.5622951            1
## 257                0       0.5985646         0 0.6565838            0
## 258                1       0.9710643         1 0.7977791            0
## 259                1       0.9589746         1 0.7922633            1
## 260                1       0.9549443         1 0.7806370            0
## 261                0       0.6918860         0 0.5766865            0
## 262                0       0.9985237         0 0.8982357            0
## 263                1       0.8399776         1 0.6923087            1
## 264                0       0.9915302         0 0.7551562            0
## 265                1       0.9956326         1 0.8826648            0
## 266                1       0.9547358         1 0.8752002            0
## 267                1       0.8960485         1 0.7327110            1
## 268                1       0.6222101         0 0.5769813            1
## 269                1       0.9995143         1 0.9429565            1
## 270                1       0.9027375         1 0.5546629            0
## 271                0       0.9819313         0 0.8049833            0
## 272                1       0.9201509         1 0.8614444            1
## 273                0       0.9984833         0 0.6930337            0
## 274                1       0.6100678         0 0.5414032            1
## 275                1       0.7803396         1 0.5175333            0
## 276                0       0.8704907         0 0.6726541            0
## 277                0       0.8580918         0 0.8068547            0
## 278                1       0.9967249         1 0.9028317            1
## 279                0       0.6276587         0 0.6460706            1
## 280                0       0.8532253         0 0.7217786            0
## 281                0       0.7642552         0 0.7148068            0
## 282                1       0.8410749         1 0.7090440            1
## 283                1       0.5896253         1 0.6433853            1
## 284                0       0.8554587         0 0.8267814            0
## 285                0       0.9893251         0 0.8502544            0
## 286                1       0.9871357         1 0.8239922            0
## 287                1       0.9902735         1 0.9128083            1
## 288                0       0.9985652         0 0.9326216            1
## 289                1       0.9915358         1 0.8658739            0
## 290                1       0.9997295         1 0.9252811            0
## 291                1       0.6164611         1 0.6474987            1
## 292                1       0.9053614         1 0.7346250            1
## 293                1       0.8422158         1 0.5884936            1
## 294                0       0.9114742         0 0.6671633            0
## 295                0       0.9984413         0 0.9119299            0
## 296                0       0.8707787         0 0.7631776            0
## 297                0       0.5836651         1 0.6607701            0
## 298                1       0.9860063         1 0.8541644            1
## 299                1       0.6162486         1 0.6223337            0
## 300                0       0.8305191         0 0.6640235            0
## 301                0       0.9427569         0 0.5294931            0
## 302                1       0.9609790         1 0.7429972            0
## 303                1       0.9739832         1 0.6503368            0
## 304                1       0.9471382         1 0.7350835            0
## 305                1       0.6543756         0 0.5728064            1
## 306                0       0.9122146         0 0.8051103            0
## 307                1       0.8238401         1 0.5624106            0
## 308                1       0.7811818         1 0.5548059            0
## 309                0       0.8406514         0 0.7738493            0
## 310                0       0.9457159         0 0.6870276            1
## 311                0       0.7998331         0 0.5510085            0
## 312                1       0.9992647         1 0.9398967            1
## 313                0       0.9079526         0 0.6203149            0
## 314                0       0.8368040         0 0.6272342            1
## 315                1       0.9925879         1 0.9578279            1
## 316                0       0.6228280         0 0.6482964            0
## 317                0       0.6739993         0 0.6757330            1
## 318                1       0.5705777         1 0.6939397            1
## 319                1       0.8237647         1 0.5931125            0
## 320                0       0.8342470         0 0.6562285            0
## 321                0       0.9322149         0 0.7823699            0
## 322                0       0.8560471         0 0.7969113            0
## 323                1       0.7914854         0 0.5084336            0
## 324                1       0.5482127         1 0.6354649            0
## 325                1       0.9986399         1 0.8703626            0
## 326                1       0.7073741         1 0.5423448            0
## 327                0       0.6749616         1 0.6179744            1
## 328                1       0.9690866         1 0.7431435            0
## 329                1       0.8789130         1 0.7282262            1
## 330                0       0.9014785         0 0.5646138            0
## 331                1       0.8707241         1 0.7249503            1
## 332                1       0.8597746         1 0.7923170            0
## 333                1       0.8215223         1 0.5616467            1
## 334                1       0.9961448         1 0.9181635            1
## 335                0       0.9988321         0 0.9423715            0
## 336                1       0.5135696         1 0.5000000            0
## 337                0       0.9899112         0 0.9073483            1
## 338                0       0.8359537         0 0.6753093            0
## 339                1       0.9705366         1 0.8288985            1
## 340                1       0.9943038         1 0.6951728            0
## 341                1       0.9163763         1 0.6460686            1
## 342                1       0.6998493         1 0.7422305            0
## 343                0       0.6632546         0 0.7485119            0
## 344                0       0.9609904         0 0.5989242            1
## 345                1       0.7437858         1 0.6102799            0
## 346                1       0.8466107         1 0.6968808            1
## 347                0       0.5543564         0 0.6482536            0
## 348                1       0.9804986         1 0.6492958            0
## 349                0       0.8736247         0 0.6875652            0
## 350                1       0.6728316         0 0.5712710            0
## 351                0       0.9860867         0 0.6926137            0
## 352                1       0.9974535         1 0.9727068            1
## 353                1       0.7442008         0 0.5370067            0
## 354                0       0.8777880         0 0.6573623            0
## 355                1       0.9973117         1 0.8947501            0
## 356                1       0.7015934         1 0.7585121            0
## 357                0       0.6393297         0 0.7159780            0
## 358                1       0.7595052         1 0.5139238            1
## 359                0       0.7916158         0 0.6007815            0
## 360                0       0.6808300         0 0.5993908            0
## 361                1       0.7723349         1 0.6779353            1
## 362                1       0.9894503         1 0.7380638            1
## 363                1       0.9896806         1 0.8968396            1
## 364                1       0.7789526         1 0.6853853            1
## 365                0       0.9965698         0 0.9042862            0
## 366                1       0.8788495         1 0.7651889            0
## 367                1       0.9660800         1 0.9385775            1
## 368                1       0.8342565         0 0.5669056            1
## 369                0       0.8947800         0 0.7513126            1
## 370                0       0.5025263         0 0.6584950            1
## 371                0       0.8104135         0 0.7068002            0
## 372                0       0.8218882         0 0.7668281            0
## 373                0       0.8226753         0 0.7980549            0
## 374                1       0.5052631         0 0.5439073            0
## 375                1       0.9982171         1 0.9412506            0
## 376                0       0.9540871         0 0.6988135            0
## 377                0       0.9926865         0 0.9181457            0
## 378                1       0.6010322         0 0.5373245            0
## 379                0       0.7856743         0 0.5208672            1
## 380                1       0.8718875         1 0.8180463            0
## 381                1       0.6489393         0 0.5290149            0
## 382                1       0.9894129         1 0.8094954            1
## 383                1       0.9796525         1 0.8596787            1
## 384                0       0.9979607         0 0.9386429            0
## 385                1       0.8638297         1 0.8101401            0
## 386                1       0.5546006         1 0.6919865            1
## 387                0       0.8180760         0 0.7012383            0
## 388                0       0.9876469         0 0.9284343            0
## 389                1       0.9823872         1 0.9204731            0
## 390                0       0.8905893         0 0.7539698            0
## 391                1       0.7984151         1 0.5000000            1
## 392                0       0.9665107         0 0.6980406            0
## 393                1       0.9958336         1 0.8410243            0
## 394                0       0.6541532         0 0.5512484            0
## 395                0       0.9948952         0 0.8221016            0
## 396                0       0.7942886         0 0.8116594            0
## 397                0       0.7587738         1 0.6051349            1
## 398                0       0.9983465         0 0.9285883            0
## 399                1       0.9571553         1 0.8026093            0
## 400                1       0.9419521         1 0.8227311            1
## 401                0       0.9985936         0 0.9140994            0
## 402                0       0.9997320         0 0.9181361            0
## 403                0       0.8044088         0 0.7147719            0
## 404                1       0.8025932         1 0.9296026            0
## 405                1       0.6334948         1 0.6185342            1
## 406                1       0.9719801         1 0.8781701            1
## 407                1       0.5331207         1 0.6234584            1
## 408                1       0.9990193         1 0.9470882            1
## 409                0       0.8305089         0 0.7008554            1
## 410                0       0.8066429         0 0.8059884            0
## 411                1       0.8025757         1 0.6282915            1
## 412                1       0.6822269         0 0.6266570            0
## 413                1       0.9465322         1 0.9194491            1
## 414                0       0.8347914         0 0.6620772            0
## 415                1       0.8162886         1 0.9187027            0
## 416                0       0.9984532         0 0.8820015            0
## 417                1       0.5484758         0 0.5182158            0
## 418                1       0.7904369         1 0.6055508            0
## 419                0       0.6879972         1 0.5000000            1
## 420                1       0.9981277         1 0.9228279            1
## 421                0       0.7664426         0 0.5344050            0
## 422                0       0.9950190         0 0.8106605            0
## 423                1       0.9737793         1 0.7264617            1
## 424                0       0.9785975         0 0.8043745            0
## 425                0       0.9365236         0 0.7434777            0
## 426                0       0.8905600         0 0.8009632            0
## 427                0       0.9963515         0 0.9026182            0
## 428                0       0.9676527         0 0.7838589            0
## 429                1       0.9449964         0 0.5430453            0
## 430                0       0.9994865         0 0.9541793            0
## 431                0       0.6503608         0 0.6474523            0
## 432                0       0.6319210         0 0.5158716            0
## 433                1       0.8570130         1 0.5473283            1
## 434                0       0.9999995         0 0.9771884            0
## 435                1       0.9753302         1 0.8168514            1
## 436                1       0.5729345         1 0.5208862            0
## 437                0       0.9087493         0 0.8208061            0
## 438                1       0.9892250         1 0.8500823            0
## 439                1       0.8351191         1 0.6783053            0
## 440                1       0.9987029         1 0.9556035            1
## 441                1       0.8200236         1 0.7363374            1
## 442                0       0.7801902         0 0.8580455            0
## 443                1       0.9913303         1 0.8545664            0
## 444                0       0.5413476         1 0.6780044            1
## 445                0       0.9786135         0 0.8499764            0
## 446                0       0.9779005         0 0.8622694            0
## 447                0       0.5845589         0 0.5688480            1
## 448                1       0.6262259         0 0.5478604            0
## 449                1       0.8167463         1 0.5157745            0
## 450                1       0.7324551         1 0.5611828            1
## 451                0       0.9832145         0 0.8103501            0
## 452                0       0.7652720         0 0.7460515            1
## 453                1       0.5011543         0 0.5673300            0
## 454                1       0.9669596         1 0.8856077            1
## 455                0       0.9743722         0 0.9112886            1
## 456                1       0.7581967         0 0.5583164            0
## 457                1       0.9604072         1 0.8664372            1
## 458                0       0.9798291         0 0.7953195            0
## 459                1       0.9620615         1 0.8356544            1
## 460                0       0.9930181         0 0.8928142            0
## 461                0       0.6809998         1 0.7355854            0
## 462                1       0.6170456         1 0.6238354            0
## 463                0       0.6961607         1 0.5948026            1
## 464                1       0.9388029         1 0.7135307            0
## 465                1       0.5362272         1 0.5677262            1
## 466                1       0.9577047         1 0.8808262            1
## 467                1       0.8981484         1 0.5280032            0
## 468                0       0.6333211         0 0.6085719            0
## 469                0       0.8817043         0 0.6362156            0
## 470                0       0.9238678         0 0.8846742            0
## 471                1       0.5239956         0 0.7341881            0
## 472                1       0.5646487         1 0.6400849            1
## 473                0       0.7097329         0 0.7561265            0
## 474                0       0.7986456         1 0.5145433            1
## 475                1       0.7309270         0 0.5099924            0
## 476                0       0.8997366         0 0.8586327            0
## 477                0       0.9958991         0 0.9417519            0
## 478                0       0.9480399         0 0.7875715            0
## 479                0       0.9999734         0 0.9888995            0
## 480                0       0.5034669         0 0.6431503            0
## 481                1       0.9935161         1 0.9020440            1
## 482                1       0.9875672         1 0.7640271            0
## 483                0       0.8762831         1 0.6589625            1
## 484                1       0.9650693         1 0.6594522            1
## 485                0       0.8171540         0 0.5850818            1
## 486                1       0.8879084         1 0.7838821            0
## 487                1       0.9993119         1 0.9492154            1
## 488                0       0.9934395         0 0.6948757            0
## 489                0       0.8499537         0 0.6463377            0
## 490                0       0.9394478         0 0.7994531            0
## 491                0       0.9952418         0 0.8408268            0
## 492                1       0.8882416         1 0.7178416            0
## 493                1       0.9490260         1 0.9046973            1
## 494                1       0.5261529         0 0.5234034            0
## 495                1       0.8336614         1 0.5965328            0
## 496                0       0.7470037         0 0.7062910            0
## 497                0       0.9827006         0 0.8020107            0
## 498                1       0.7805598         1 0.6247937            0
## 499                0       0.7382974         0 0.7111972            0
## 500                0       0.9515785         0 0.8849098            0
##     GLMNET_PROB TREE_LABEL TREE_PROB MANUAL_CODE CONSENSUS_CODE
## 1     0.5288133          0 0.6616541           0              0
## 2     0.8873376          0 0.6616541           0              0
## 3     0.9923295          1 1.0000000           1              1
## 4     0.8721474          0 0.6616541           0              0
## 5     0.8780568          0 0.6616541           1              0
## 6     0.8746519          0 0.6616541           1              0
## 7     0.8472148          0 0.6616541           1              0
## 8     0.7968811          0 0.6616541           0              0
## 9     0.8348502          0 0.6616541           0              0
## 10    0.8513187          0 0.6616541           0              0
## 11    0.8746533          0 0.6616541           0              0
## 12    0.6177916          1 1.0000000           0              1
## 13    0.7382742          0 0.6616541           0              0
## 14    0.7293119          0 0.6616541           0              0
## 15    0.9775675          0 0.6616541           0              0
## 16    0.9377424          0 0.6616541           0              0
## 17    0.8959542          0 0.6616541           1              0
## 18    0.6561360          0 0.6616541           0              0
## 19    0.9487429          0 0.6616541           1              0
## 20    0.6793379          0 0.6616541           1              0
## 21    0.7873755          0 0.6616541           1              0
## 22    0.7772535          0 0.6616541           0              0
## 23    0.9821741          1 1.0000000           1              1
## 24    0.6879190          0 0.6616541           1              0
## 25    0.9981822          1 1.0000000           1              1
## 26    0.8016019          0 0.6616541           0              0
## 27    0.9531810          0 0.6616541           0              0
## 28    0.8895098          0 0.6616541           0              0
## 29    0.8746534          0 0.6616541           0              0
## 30    0.8746534          0 0.6616541           0              0
## 31    0.5653546          0 0.6616541           0              0
## 32    0.9850735          0 0.6616541           1              1
## 33    0.9947636          1 1.0000000           1              1
## 34    0.5936176          0 0.6616541           1              0
## 35    0.9496439          0 0.6616541           1              1
## 36    0.7859645          0 0.6616541           0              0
## 37    0.7668888          0 0.6616541           1              0
## 38    0.5687664          0 0.6616541           1              0
## 39    0.8746534          0 0.6616541           1              0
## 40    0.7332054          0 0.6616541           0              0
## 41    0.6844681          0 0.6616541           1              0
## 42    0.8746534          0 0.6616541           0              0
## 43    0.5468797          0 0.6616541           1              0
## 44    0.9154582          0 0.6616541           0              0
## 45    0.8876478          0 0.6616541           0              0
## 46    0.8746534          0 0.6616541           1              0
## 47    0.9124036          0 0.6616541           0              0
## 48    0.9392254          0 0.6616541           1              1
## 49    0.9269598          0 0.6616541           0              1
## 50    0.8746534          0 0.6616541           0              0
## 51    0.7311568          0 0.6616541           1              0
## 52    0.6349322          0 0.6616541           0              0
## 53    0.8746534          0 0.6616541           1              0
## 54    0.5961871          0 0.6616541           1              0
## 55    0.5020932          0 0.6616541           1              0
## 56    0.9772314          0 0.6616541           1              0
## 57    0.8864575          0 0.6616541           0              0
## 58    0.9689157          0 0.6616541           0              0
## 59    0.8455672          0 0.6616541           1              0
## 60    0.9221623          0 0.6616541           0              0
## 61    0.9401425          1 1.0000000           1              1
## 62    0.7596314          1 1.0000000           1              1
## 63    0.9481777          0 0.6616541           1              0
## 64    0.8586796          0 0.6616541           1              1
## 65    0.8066477          1 1.0000000           1              1
## 66    0.6352883          0 0.6616541           1              0
## 67    0.9089932          0 0.6616541           0              0
## 68    0.8834282          0 0.6616541           0              0
## 69    0.7074154          0 0.6616541           0              0
## 70    0.8832441          0 0.6616541           0              0
## 71    0.9995702          1 1.0000000           1              1
## 72    0.8214399          0 0.6616541           0              0
## 73    0.7481269          0 0.6616541           1              0
## 74    0.9770322          1 1.0000000           0              1
## 75    0.5162810          0 0.6616541           0              0
## 76    0.8498646          0 0.6616541           0              0
## 77    0.7032368          0 0.6616541           0              0
## 78    0.8746534          0 0.6616541           1              0
## 79    0.9983300          0 0.6616541           0              0
## 80    0.8746534          0 0.6616541           1              0
## 81    0.8746534          0 0.6616541           0              0
## 82    0.6610544          1 1.0000000           1              1
## 83    0.6091501          1 1.0000000           0              0
## 84    0.5713303          0 0.6616541           0              0
## 85    0.8746534          0 0.6616541           0              0
## 86    0.5786507          0 0.6616541           1              0
## 87    0.8746534          0 0.6616541           0              0
## 88    0.5827670          0 0.6616541           0              0
## 89    0.9021534          0 0.6616541           0              0
## 90    0.5226191          0 0.6616541           0              0
## 91    0.5859559          0 0.6616541           1              0
## 92    0.8746534          0 0.6616541           0              0
## 93    0.7903308          0 0.6616541           1              0
## 94    0.5182800          0 0.6616541           1              1
## 95    0.5841865          0 0.6616541           1              0
## 96    0.6364076          0 0.6616541           0              0
## 97    0.8746534          0 0.6616541           1              0
## 98    0.8746534          0 0.6616541           0              0
## 99    0.8746534          0 0.6616541           0              0
## 100   0.6542612          0 0.6616541           1              1
## 101   0.9019525          0 0.6616541           0              0
## 102   0.9706423          0 0.6616541           1              1
## 103   0.5619032          0 0.6616541           1              1
## 104   0.5616499          1 1.0000000           0              1
## 105   0.5606976          0 0.6616541           1              0
## 106   0.6709817          0 0.6616541           1              1
## 107   0.6667234          0 0.6616541           1              0
## 108   0.7156138          0 0.6616541           0              0
## 109   0.7076696          0 0.6616541           0              0
## 110   0.8312640          0 0.6616541           1              0
## 111   0.8595656          0 0.6616541           0              0
## 112   0.8199301          1 1.0000000           1              1
## 113   0.8874999          0 0.6616541           0              0
## 114   0.8568738          0 0.6616541           0              0
## 115   0.5106395          0 0.6616541           1              0
## 116   0.7257786          0 0.6616541           1              0
## 117   0.8327206          0 0.6616541           0              0
## 118   0.6217435          0 0.6616541           1              0
## 119   0.9819054          1 1.0000000           1              1
## 120   0.7668394          0 0.6616541           1              0
## 121   0.7550201          0 0.6616541           0              0
## 122   0.8746534          0 0.6616541           1              0
## 123   0.8746534          0 0.6616541           0              0
## 124   0.8601919          0 0.6616541           0              0
## 125   0.6315163          1 1.0000000           1              0
## 126   0.8746534          0 0.6616541           0              0
## 127   0.8746534          0 0.6616541           0              0
## 128   0.8952274          0 0.6616541           1              0
## 129   0.9068703          0 0.6616541           0              0
## 130   0.8986318          0 0.6616541           0              0
## 131   0.8746534          0 0.6616541           1              0
## 132   0.8904833          0 0.6616541           0              0
## 133   0.7339450          0 0.6616541           1              0
## 134   0.9369540          0 0.6616541           1              1
## 135   0.6844138          0 0.6616541           0              0
## 136   0.8384454          0 0.6616541           0              0
## 137   0.8658935          0 0.6616541           0              0
## 138   0.8769053          0 0.6616541           0              1
## 139   0.6032253          0 0.6616541           1              0
## 140   0.6921802          0 0.6616541           0              0
## 141   0.8746534          0 0.6616541           0              0
## 142   0.8746534          0 0.6616541           1              0
## 143   0.7434196          0 0.6616541           0              0
## 144   0.8562146          0 0.6616541           0              0
## 145   0.7724615          0 0.6616541           0              0
## 146   0.8864544          0 0.6616541           1              0
## 147   0.7059566          0 0.6616541           0              0
## 148   0.8594154          0 0.6616541           0              0
## 149   0.6600620          0 0.6616541           0              0
## 150   0.7747498          0 0.6616541           1              0
## 151   0.8746534          0 0.6616541           1              0
## 152   0.9131066          0 0.6616541           1              0
## 153   0.6645777          1 1.0000000           0              0
## 154   0.8969748          0 0.6616541           0              0
## 155   0.8556519          0 0.6616541           1              0
## 156   0.9166776          0 0.6616541           0              0
## 157   0.5922849          0 0.6616541           1              1
## 158   0.5038534          0 0.6616541           1              0
## 159   0.6181674          0 0.6616541           0              0
## 160   0.6665485          0 0.6616541           1              0
## 161   0.8746534          0 0.6616541           0              0
## 162   0.7137833          0 0.6616541           0              0
## 163   0.9443730          0 0.6616541           0              0
## 164   0.8403756          0 0.6616541           0              0
## 165   0.8746468          1 1.0000000           1              1
## 166   0.8281456          0 0.6616541           1              0
## 167   0.8746534          0 0.6616541           0              0
## 168   0.7565338          0 0.6616541           0              0
## 169   0.8953264          0 0.6616541           0              0
## 170   0.6682192          0 0.6616541           1              0
## 171   0.8900256          1 1.0000000           1              1
## 172   0.6173225          0 0.6616541           1              0
## 173   0.8283956          0 0.6616541           1              0
## 174   0.8986282          0 0.6616541           0              0
## 175   0.5171646          0 0.6616541           1              0
## 176   0.6286292          1 1.0000000           0              1
## 177   0.9104825          0 0.6616541           0              0
## 178   0.9464180          0 0.6616541           1              0
## 179   0.8885275          0 0.6616541           1              0
## 180   0.8207597          0 0.6616541           1              1
## 181   0.6099646          0 0.6616541           1              0
## 182   0.7626637          0 0.6616541           0              0
## 183   0.6563896          0 0.6616541           1              0
## 184   0.8409868          0 0.6616541           1              1
## 185   0.5293677          0 0.6616541           0              0
## 186   0.9162461          0 0.6616541           0              0
## 187   0.8746534          0 0.6616541           1              0
## 188   0.8910004          0 0.6616541           0              0
## 189   0.8273065          0 0.6616541           0              0
## 190   0.9281266          0 0.6616541           1              1
## 191   0.5024032          0 0.6616541           0              0
## 192   0.5553232          0 0.6616541           1              1
## 193   0.8746534          0 0.6616541           0              0
## 194   0.8749611          0 0.6616541           1              0
## 195   0.5410156          0 0.6616541           1              1
## 196   0.8014755          0 0.6616541           1              1
## 197   0.6256132          0 0.6616541           1              0
## 198   0.9097767          0 0.6616541           0              0
## 199   0.6772744          1 1.0000000           0              0
## 200   0.9271029          0 0.6616541           0              0
## 201   0.8765775          0 0.6616541           1              0
## 202   0.8847603          0 0.6616541           0              0
## 203   0.5699222          0 0.6616541           0              0
## 204   0.8222035          0 0.6616541           0              0
## 205   0.8746534          0 0.6616541           1              0
## 206   0.8746534          0 0.6616541           1              0
## 207   0.8500820          0 0.6616541           1              1
## 208   0.6184018          0 0.6616541           1              0
## 209   0.7305457          0 0.6616541           1              0
## 210   0.5376464          0 0.6616541           0              0
## 211   0.9535965          0 0.6616541           1              0
## 212   0.6531240          0 0.6616541           0              0
## 213   0.7373271          0 0.6616541           1              0
## 214   0.6491119          0 0.6616541           0              0
## 215   0.9286990          0 0.6616541           0              0
## 216   0.8746534          0 0.6616541           0              0
## 217   0.7704956          0 0.6616541           1              0
## 218   0.7749730          0 0.6616541           0              0
## 219   0.7226916          0 0.6616541           1              0
## 220   0.8746534          0 0.6616541           0              0
## 221   0.7775379          0 0.6616541           1              1
## 222   0.5794264          0 0.6616541           1              0
## 223   0.6628768          0 0.6616541           1              0
## 224   0.6536263          0 0.6616541           1              0
## 225   0.9037813          0 0.6616541           0              0
## 226   0.8746534          0 0.6616541           1              0
## 227   0.9114148          0 0.6616541           0              0
## 228   0.6858137          0 0.6616541           0              1
## 229   0.7137479          0 0.6616541           1              1
## 230   0.5809665          0 0.6616541           1              0
## 231   0.8387342          1 1.0000000           0              0
## 232   0.7267735          0 0.6616541           0              0
## 233   0.7618806          0 0.6616541           0              0
## 234   0.8129286          0 0.6616541           0              0
## 235   0.9337398          0 0.6616541           0              0
## 236   0.8040740          0 0.6616541           1              0
## 237   0.8176373          0 0.6616541           1              0
## 238   0.6446767          0 0.6616541           0              0
## 239   0.8746534          0 0.6616541           1              0
## 240   0.8746534          0 0.6616541           0              0
## 241   0.6408152          0 0.6616541           0              1
## 242   0.8746534          0 0.6616541           0              0
## 243   0.9689157          0 0.6616541           0              0
## 244   0.6692096          0 0.6616541           1              0
## 245   0.8800186          0 0.6616541           0              0
## 246   0.9967607          1 1.0000000           1              1
## 247   0.6000537          0 0.6616541           1              0
## 248   0.9940200          0 0.6616541           1              1
## 249   0.8264154          0 0.6616541           0              0
## 250   0.9979586          1 1.0000000           1              1
## 251   0.8746534          0 0.6616541           0              0
## 252   0.6779741          0 0.6616541           0              0
## 253   0.9199594          0 0.6616541           0              0
## 254   0.6814197          0 0.6616541           1              0
## 255   0.5887814          0 0.6616541           1              0
## 256   0.9931163          0 0.6616541           1              0
## 257   0.8004220          0 0.6616541           0              0
## 258   0.8062138          0 0.6616541           1              0
## 259   0.9982997          0 0.6616541           1              1
## 260   0.7225246          0 0.6616541           1              0
## 261   0.9200165          0 0.6616541           1              0
## 262   0.8746534          0 0.6616541           0              0
## 263   0.9619019          0 0.6616541           0              1
## 264   0.7348503          0 0.6616541           0              0
## 265   0.6411702          0 0.6616541           1              0
## 266   0.8746534          0 0.6616541           1              0
## 267   0.6504270          0 0.6616541           1              1
## 268   0.9273858          1 1.0000000           0              1
## 269   0.9912170          1 1.0000000           1              1
## 270   0.7350740          0 0.6616541           1              0
## 271   0.8746534          0 0.6616541           1              0
## 272   0.6740189          0 0.6616541           1              1
## 273   0.9244426          0 0.6616541           0              0
## 274   0.6117752          0 0.6616541           0              0
## 275   0.5125766          0 0.6616541           1              0
## 276   0.8078549          0 0.6616541           0              0
## 277   0.8746534          0 0.6616541           0              0
## 278   0.9993863          1 1.0000000           1              1
## 279   0.9873018          1 1.0000000           0              0
## 280   0.5832589          0 0.6616541           0              0
## 281   0.8249900          0 0.6616541           0              0
## 282   0.9808298          0 0.6616541           1              1
## 283   0.7999144          0 0.6616541           1              1
## 284   0.6644174          0 0.6616541           0              0
## 285   0.6062395          0 0.6616541           0              0
## 286   0.6346054          0 0.6616541           1              0
## 287   0.7637505          0 0.6616541           1              1
## 288   0.8381115          0 0.6616541           0              0
## 289   0.7789835          0 0.6616541           1              0
## 290   0.8746534          0 0.6616541           1              0
## 291   0.5536288          0 0.6616541           0              1
## 292   0.9591116          0 0.6616541           1              1
## 293   0.8262910          0 0.6616541           1              1
## 294   0.6851140          0 0.6616541           0              0
## 295   0.6543690          0 0.6616541           0              0
## 296   0.5927256          0 0.6616541           0              0
## 297   0.8746534          0 0.6616541           1              0
## 298   0.9601066          0 0.6616541           1              1
## 299   0.8091701          0 0.6616541           0              0
## 300   0.7413772          0 0.6616541           0              0
## 301   0.6751602          0 0.6616541           1              0
## 302   0.8746534          0 0.6616541           1              0
## 303   0.9029062          0 0.6616541           1              0
## 304   0.7288891          0 0.6616541           1              0
## 305   0.6345383          1 1.0000000           0              1
## 306   0.9208078          0 0.6616541           0              0
## 307   0.5722034          0 0.6616541           1              0
## 308   0.9030216          0 0.6616541           1              0
## 309   0.8637540          0 0.6616541           0              0
## 310   0.5019201          0 0.6616541           0              0
## 311   0.6699482          0 0.6616541           1              0
## 312   0.9919655          1 1.0000000           1              1
## 313   0.8373843          0 0.6616541           1              0
## 314   0.9178607          1 1.0000000           0              0
## 315   0.8805268          1 1.0000000           1              1
## 316   0.8746534          0 0.6616541           1              0
## 317   0.7853267          0 0.6616541           0              0
## 318   0.8330249          0 0.6616541           1              1
## 319   0.6889858          0 0.6616541           1              0
## 320   0.6609509          0 0.6616541           0              0
## 321   0.6329276          0 0.6616541           0              0
## 322   0.8746534          0 0.6616541           0              0
## 323   0.8746534          0 0.6616541           0              0
## 324   0.6777173          0 0.6616541           1              0
## 325   0.8870735          0 0.6616541           1              0
## 326   0.6844521          0 0.6616541           1              0
## 327   0.8151715          0 0.6616541           1              0
## 328   0.7168045          0 0.6616541           1              0
## 329   0.9917773          1 1.0000000           1              1
## 330   0.9171512          0 0.6616541           0              0
## 331   0.8248358          0 0.6616541           1              1
## 332   0.8308865          0 0.6616541           1              0
## 333   0.6372078          1 1.0000000           0              1
## 334   0.6881883          0 0.6616541           1              1
## 335   0.9176487          0 0.6616541           0              0
## 336   0.8321578          0 0.6616541           1              0
## 337   0.7311433          0 0.6616541           0              0
## 338   0.8416069          0 0.6616541           0              0
## 339   0.5816619          0 0.6616541           1              1
## 340   0.8746534          0 0.6616541           1              0
## 341   0.6930336          0 0.6616541           0              1
## 342   0.7817926          0 0.6616541           1              0
## 343   0.8746534          0 0.6616541           0              0
## 344   0.8817977          0 0.6616541           0              0
## 345   0.6281196          0 0.6616541           1              0
## 346   0.9904422          1 1.0000000           1              1
## 347   0.6629598          0 0.6616541           0              0
## 348   0.7940204          0 0.6616541           1              0
## 349   0.8925611          0 0.6616541           0              0
## 350   0.9214585          0 0.6616541           0              0
## 351   0.8082103          0 0.6616541           0              0
## 352   0.9994504          0 0.6616541           1              1
## 353   0.8974176          0 0.6616541           1              0
## 354   0.7840688          0 0.6616541           0              0
## 355   0.8746534          0 0.6616541           1              0
## 356   0.8746534          0 0.6616541           1              0
## 357   0.7632438          0 0.6616541           0              0
## 358   0.7417794          0 0.6616541           0              1
## 359   0.5691815          0 0.6616541           0              0
## 360   0.8746534          0 0.6616541           1              0
## 361   0.8599967          1 1.0000000           1              1
## 362   0.7488982          0 0.6616541           1              1
## 363   0.7639572          0 0.6616541           1              1
## 364   0.6777089          0 0.6616541           1              1
## 365   0.8754256          0 0.6616541           0              0
## 366   0.9222848          0 0.6616541           1              0
## 367   0.6697521          0 0.6616541           1              1
## 368   0.9484218          0 0.6616541           0              0
## 369   0.6887910          0 0.6616541           0              0
## 370   0.6279341          1 1.0000000           0              0
## 371   0.8746265          0 0.6616541           0              0
## 372   0.8746534          0 0.6616541           0              0
## 373   0.6400280          0 0.6616541           0              0
## 374   0.5251890          0 0.6616541           1              0
## 375   0.8746534          0 0.6616541           0              0
## 376   0.6334517          0 0.6616541           0              0
## 377   0.7636947          0 0.6616541           0              0
## 378   0.8577017          0 0.6616541           1              0
## 379   0.9644245          0 0.6616541           0              0
## 380   0.7237385          0 0.6616541           1              0
## 381   0.8618935          0 0.6616541           0              0
## 382   0.7836579          0 0.6616541           1              1
## 383   0.9728826          0 0.6616541           1              1
## 384   0.9425455          0 0.6616541           0              0
## 385   0.5030809          1 1.0000000           1              1
## 386   0.9282131          1 1.0000000           1              1
## 387   0.7008030          0 0.6616541           0              0
## 388   0.9067455          0 0.6616541           0              0
## 389   0.8183036          0 0.6616541           1              0
## 390   0.8579085          0 0.6616541           0              0
## 391   0.6819387          0 0.6616541           0              1
## 392   0.7658199          0 0.6616541           0              0
## 393   0.8746534          0 0.6616541           1              0
## 394   0.8881393          0 0.6616541           0              0
## 395   0.8894261          0 0.6616541           0              0
## 396   0.8746534          0 0.6616541           0              0
## 397   0.9375659          0 0.6616541           0              0
## 398   0.9023910          0 0.6616541           0              0
## 399   0.8746534          0 0.6616541           1              0
## 400   0.9214765          0 0.6616541           1              1
## 401   0.7005511          0 0.6616541           0              0
## 402   0.8746534          0 0.6616541           0              0
## 403   0.7470252          0 0.6616541           0              0
## 404   0.7710190          0 0.6616541           1              0
## 405   0.8996229          1 1.0000000           1              1
## 406   0.6804060          0 0.6616541           1              1
## 407   0.9371866          1 1.0000000           0              1
## 408   0.9301479          0 0.6616541           1              1
## 409   0.7996398          0 0.6616541           0              0
## 410   0.5743247          0 0.6616541           0              0
## 411   0.6372078          1 1.0000000           1              1
## 412   0.8746534          0 0.6616541           1              0
## 413   0.7699589          0 0.6616541           1              1
## 414   0.8001484          0 0.6616541           0              0
## 415   0.9073975          0 0.6616541           0              0
## 416   0.8746534          0 0.6616541           0              0
## 417   0.8887497          0 0.6616541           1              0
## 418   0.6116572          1 1.0000000           0              1
## 419   0.8640359          1 1.0000000           1              1
## 420   0.6981186          0 0.6616541           1              1
## 421   0.8806481          0 0.6616541           0              0
## 422   0.8746496          0 0.6616541           0              0
## 423   0.9123808          0 0.6616541           1              1
## 424   0.8746534          0 0.6616541           1              0
## 425   0.8146468          0 0.6616541           0              0
## 426   0.9201436          0 0.6616541           1              0
## 427   0.9104727          0 0.6616541           0              0
## 428   0.8746534          0 0.6616541           0              0
## 429   0.7007718          0 0.6616541           1              0
## 430   0.9449447          0 0.6616541           0              0
## 431   0.8185533          0 0.6616541           0              0
## 432   0.6710381          0 0.6616541           1              0
## 433   0.7552511          0 0.6616541           1              1
## 434   0.9278294          0 0.6616541           0              0
## 435   0.9001330          1 1.0000000           1              1
## 436   0.8746534          0 0.6616541           1              0
## 437   0.8360667          0 0.6616541           0              0
## 438   0.6116088          0 0.6616541           1              0
## 439   0.7740231          0 0.6616541           0              0
## 440   0.9971169          1 1.0000000           1              1
## 441   0.5545698          1 1.0000000           1              1
## 442   0.5498202          0 0.6616541           0              0
## 443   0.7276496          0 0.6616541           1              0
## 444   0.9770802          0 0.6616541           1              0
## 445   0.8548578          0 0.6616541           0              0
## 446   0.8414296          0 0.6616541           0              0
## 447   0.5707888          0 0.6616541           1              0
## 448   0.5907567          0 0.6616541           0              0
## 449   0.8746534          0 0.6616541           0              0
## 450   0.9190581          1 1.0000000           1              1
## 451   0.9155459          0 0.6616541           0              0
## 452   0.8853009          0 0.6616541           0              0
## 453   0.7696008          0 0.6616541           1              0
## 454   0.7228904          0 0.6616541           1              1
## 455   0.9448578          1 1.0000000           0              0
## 456   0.8808083          0 0.6616541           1              0
## 457   0.7826945          0 0.6616541           1              1
## 458   0.8746534          0 0.6616541           0              0
## 459   0.9801429          1 1.0000000           1              1
## 460   0.8924172          0 0.6616541           0              0
## 461   0.8746534          0 0.6616541           1              0
## 462   0.8746534          0 0.6616541           1              0
## 463   0.6770943          1 1.0000000           0              1
## 464   0.8975955          0 0.6616541           1              0
## 465   0.6621775          0 0.6616541           1              1
## 466   0.8594049          0 0.6616541           1              1
## 467   0.8494618          0 0.6616541           0              0
## 468   0.7852747          0 0.6616541           1              0
## 469   0.8843624          0 0.6616541           0              0
## 470   0.8746534          0 0.6616541           0              0
## 471   0.8746534          0 0.6616541           0              0
## 472   0.7880006          0 0.6616541           0              1
## 473   0.8761443          0 0.6616541           0              0
## 474   0.8524260          0 0.6616541           0              0
## 475   0.8847374          0 0.6616541           1              0
## 476   0.9418128          0 0.6616541           0              0
## 477   0.9218629          0 0.6616541           0              0
## 478   0.8716173          0 0.6616541           0              0
## 479   0.9429103          0 0.6616541           0              0
## 480   0.6808381          0 0.6616541           0              0
## 481   0.8661160          0 0.6616541           1              1
## 482   0.8672682          0 0.6616541           0              0
## 483   0.5456351          0 0.6616541           1              0
## 484   0.7057263          0 0.6616541           0              1
## 485   0.9029786          1 1.0000000           0              0
## 486   0.5994841          0 0.6616541           1              0
## 487   0.9048033          0 0.6616541           1              1
## 488   0.8632586          0 0.6616541           0              0
## 489   0.5560813          0 0.6616541           0              0
## 490   0.8322249          0 0.6616541           0              0
## 491   0.8788313          0 0.6616541           0              0
## 492   0.8876063          0 0.6616541           1              0
## 493   0.6241686          1 1.0000000           1              1
## 494   0.8148390          1 1.0000000           1              0
## 495   0.6538010          0 0.6616541           0              0
## 496   0.9102920          0 0.6616541           0              0
## 497   0.9044113          0 0.6616541           0              0
## 498   0.7591783          0 0.6616541           1              0
## 499   0.7433828          0 0.6616541           0              0
## 500   0.8746534          0 0.6616541           0              0
##     CONSENSUS_AGREE CONSENSUS_INCORRECT PROBABILITY_CODE
## 1                 2                   0                1
## 2                 4                   0                0
## 3                 4                   0                1
## 4                 2                   0                0
## 5                 2                   1                0
## 6                 3                   1                0
## 7                 2                   1                0
## 8                 4                   0                0
## 9                 4                   0                0
## 10                4                   0                0
## 11                4                   0                0
## 12                4                   1                1
## 13                3                   0                0
## 14                4                   0                0
## 15                4                   0                0
## 16                4                   0                0
## 17                4                   1                0
## 18                4                   0                0
## 19                3                   1                0
## 20                2                   1                0
## 21                2                   1                1
## 22                3                   0                0
## 23                4                   0                1
## 24                4                   1                0
## 25                4                   0                1
## 26                4                   0                0
## 27                4                   0                0
## 28                3                   0                1
## 29                4                   0                0
## 30                2                   0                0
## 31                3                   0                0
## 32                3                   0                1
## 33                4                   0                1
## 34                2                   1                1
## 35                3                   0                1
## 36                4                   0                0
## 37                3                   1                0
## 38                2                   1                1
## 39                2                   1                1
## 40                4                   0                0
## 41                2                   1                1
## 42                4                   0                0
## 43                3                   1                1
## 44                3                   0                0
## 45                4                   0                0
## 46                3                   1                0
## 47                4                   0                0
## 48                3                   0                1
## 49                3                   1                1
## 50                4                   0                0
## 51                4                   1                0
## 52                2                   0                1
## 53                4                   1                0
## 54                4                   1                0
## 55                2                   1                1
## 56                2                   1                0
## 57                4                   0                0
## 58                3                   0                0
## 59                3                   1                1
## 60                2                   0                0
## 61                4                   0                1
## 62                4                   0                1
## 63                3                   1                0
## 64                3                   0                1
## 65                4                   0                1
## 66                3                   1                0
## 67                3                   0                0
## 68                4                   0                0
## 69                4                   0                0
## 70                4                   0                0
## 71                4                   0                1
## 72                2                   0                1
## 73                2                   1                0
## 74                4                   1                1
## 75                4                   0                0
## 76                4                   0                0
## 77                2                   0                0
## 78                2                   1                0
## 79                2                   0                1
## 80                2                   1                1
## 81                4                   0                0
## 82                4                   0                1
## 83                2                   0                1
## 84                2                   0                1
## 85                4                   0                0
## 86                4                   1                0
## 87                4                   0                0
## 88                2                   0                0
## 89                4                   0                0
## 90                2                   0                1
## 91                4                   1                0
## 92                4                   0                0
## 93                2                   1                1
## 94                3                   0                1
## 95                2                   1                1
## 96                3                   0                0
## 97                3                   1                0
## 98                3                   0                0
## 99                2                   0                1
## 100               3                   0                0
## 101               4                   0                0
## 102               3                   0                1
## 103               3                   0                1
## 104               3                   1                1
## 105               2                   1                1
## 106               3                   0                1
## 107               3                   1                0
## 108               4                   0                0
## 109               4                   0                0
## 110               2                   1                1
## 111               4                   0                0
## 112               4                   0                1
## 113               4                   0                0
## 114               4                   0                0
## 115               3                   1                1
## 116               2                   1                1
## 117               4                   0                0
## 118               4                   1                0
## 119               4                   0                1
## 120               3                   1                0
## 121               4                   0                0
## 122               2                   1                1
## 123               4                   0                0
## 124               4                   0                0
## 125               2                   1                1
## 126               4                   0                0
## 127               4                   0                0
## 128               4                   1                0
## 129               4                   0                0
## 130               4                   0                0
## 131               2                   1                1
## 132               4                   0                0
## 133               2                   1                1
## 134               3                   0                1
## 135               2                   0                1
## 136               2                   0                0
## 137               3                   0                0
## 138               3                   1                1
## 139               2                   1                1
## 140               4                   0                0
## 141               4                   0                0
## 142               2                   1                1
## 143               4                   0                0
## 144               4                   0                0
## 145               4                   0                0
## 146               4                   1                0
## 147               4                   0                0
## 148               4                   0                0
## 149               3                   0                0
## 150               3                   1                1
## 151               2                   1                1
## 152               2                   1                1
## 153               2                   0                1
## 154               4                   0                0
## 155               4                   1                0
## 156               4                   0                0
## 157               3                   0                1
## 158               2                   1                1
## 159               4                   0                0
## 160               2                   1                1
## 161               4                   0                0
## 162               3                   0                0
## 163               4                   0                0
## 164               4                   0                0
## 165               3                   0                1
## 166               2                   1                1
## 167               4                   0                0
## 168               4                   0                0
## 169               4                   0                0
## 170               3                   1                0
## 171               4                   0                1
## 172               4                   1                0
## 173               2                   1                1
## 174               4                   0                0
## 175               2                   1                1
## 176               4                   1                1
## 177               4                   0                0
## 178               2                   1                0
## 179               4                   1                0
## 180               3                   0                1
## 181               2                   1                1
## 182               4                   0                0
## 183               3                   1                0
## 184               3                   0                1
## 185               4                   0                0
## 186               4                   0                0
## 187               2                   1                1
## 188               4                   0                0
## 189               3                   0                0
## 190               3                   0                1
## 191               4                   0                0
## 192               3                   0                1
## 193               4                   0                0
## 194               4                   1                0
## 195               3                   0                1
## 196               3                   0                1
## 197               2                   1                1
## 198               4                   0                0
## 199               3                   0                1
## 200               2                   0                1
## 201               4                   1                0
## 202               2                   0                0
## 203               2                   0                0
## 204               4                   0                0
## 205               2                   1                0
## 206               2                   1                1
## 207               3                   0                1
## 208               2                   1                1
## 209               2                   1                1
## 210               4                   0                0
## 211               2                   1                1
## 212               4                   0                0
## 213               2                   1                1
## 214               4                   0                0
## 215               4                   0                0
## 216               4                   0                0
## 217               2                   1                1
## 218               4                   0                0
## 219               3                   1                0
## 220               4                   0                0
## 221               3                   0                1
## 222               3                   1                0
## 223               2                   1                1
## 224               2                   1                1
## 225               3                   0                0
## 226               2                   1                1
## 227               4                   0                0
## 228               3                   1                1
## 229               3                   0                1
## 230               2                   1                1
## 231               2                   0                1
## 232               4                   0                0
## 233               4                   0                0
## 234               3                   0                0
## 235               4                   0                0
## 236               2                   1                1
## 237               2                   1                0
## 238               4                   0                0
## 239               2                   1                1
## 240               3                   0                1
## 241               3                   1                1
## 242               4                   0                0
## 243               3                   0                0
## 244               2                   1                1
## 245               4                   0                0
## 246               4                   0                1
## 247               4                   1                0
## 248               3                   0                1
## 249               4                   0                0
## 250               4                   0                1
## 251               4                   0                0
## 252               2                   0                1
## 253               4                   0                0
## 254               2                   1                1
## 255               4                   1                0
## 256               2                   1                1
## 257               4                   0                0
## 258               2                   1                1
## 259               3                   0                1
## 260               2                   1                1
## 261               4                   1                0
## 262               4                   0                0
## 263               3                   1                1
## 264               4                   0                0
## 265               2                   1                1
## 266               2                   1                1
## 267               3                   0                1
## 268               3                   1                1
## 269               4                   0                1
## 270               2                   1                1
## 271               4                   1                0
## 272               3                   0                1
## 273               4                   0                0
## 274               2                   0                0
## 275               2                   1                1
## 276               4                   0                0
## 277               4                   0                0
## 278               4                   0                1
## 279               2                   0                1
## 280               4                   0                0
## 281               4                   0                0
## 282               3                   0                1
## 283               3                   0                1
## 284               4                   0                0
## 285               4                   0                0
## 286               2                   1                1
## 287               3                   0                1
## 288               3                   0                0
## 289               2                   1                1
## 290               2                   1                1
## 291               3                   1                0
## 292               3                   0                1
## 293               3                   0                1
## 294               4                   0                0
## 295               4                   0                0
## 296               4                   0                0
## 297               3                   1                0
## 298               3                   0                1
## 299               2                   0                0
## 300               4                   0                0
## 301               4                   1                0
## 302               2                   1                1
## 303               2                   1                1
## 304               2                   1                1
## 305               3                   1                1
## 306               4                   0                0
## 307               2                   1                1
## 308               2                   1                0
## 309               4                   0                0
## 310               3                   0                0
## 311               4                   1                0
## 312               4                   0                1
## 313               4                   1                0
## 314               2                   0                1
## 315               4                   0                1
## 316               4                   1                0
## 317               3                   0                1
## 318               3                   0                1
## 319               2                   1                1
## 320               4                   0                0
## 321               4                   0                0
## 322               4                   0                0
## 323               3                   0                0
## 324               2                   1                0
## 325               2                   1                1
## 326               2                   1                1
## 327               2                   1                1
## 328               2                   1                1
## 329               4                   0                1
## 330               4                   0                0
## 331               3                   0                1
## 332               2                   1                1
## 333               4                   1                1
## 334               3                   0                1
## 335               4                   0                0
## 336               2                   1                0
## 337               3                   0                0
## 338               4                   0                0
## 339               3                   0                1
## 340               2                   1                1
## 341               3                   1                1
## 342               2                   1                0
## 343               4                   0                0
## 344               3                   0                0
## 345               2                   1                1
## 346               4                   0                1
## 347               4                   0                0
## 348               2                   1                1
## 349               4                   0                0
## 350               3                   0                0
## 351               4                   0                0
## 352               3                   0                1
## 353               3                   1                0
## 354               4                   0                0
## 355               2                   1                1
## 356               2                   1                0
## 357               4                   0                0
## 358               3                   1                1
## 359               4                   0                0
## 360               4                   1                0
## 361               4                   0                1
## 362               3                   0                1
## 363               3                   0                1
## 364               3                   0                1
## 365               4                   0                0
## 366               2                   1                0
## 367               3                   0                1
## 368               2                   0                1
## 369               3                   0                0
## 370               2                   0                1
## 371               4                   0                0
## 372               4                   0                0
## 373               4                   0                0
## 374               3                   1                0
## 375               2                   0                1
## 376               4                   0                0
## 377               4                   0                0
## 378               3                   1                0
## 379               3                   0                1
## 380               2                   1                1
## 381               3                   0                0
## 382               3                   0                1
## 383               3                   0                1
## 384               4                   0                0
## 385               3                   0                1
## 386               4                   0                1
## 387               4                   0                0
## 388               4                   0                0
## 389               2                   1                1
## 390               4                   0                0
## 391               3                   1                1
## 392               4                   0                0
## 393               2                   1                1
## 394               4                   0                0
## 395               4                   0                0
## 396               4                   0                0
## 397               2                   0                1
## 398               4                   0                0
## 399               2                   1                1
## 400               3                   0                1
## 401               4                   0                0
## 402               4                   0                0
## 403               4                   0                0
## 404               2                   1                1
## 405               4                   0                1
## 406               3                   0                1
## 407               4                   1                1
## 408               3                   0                1
## 409               3                   0                0
## 410               4                   0                0
## 411               4                   0                1
## 412               3                   1                0
## 413               3                   0                1
## 414               4                   0                0
## 415               2                   0                1
## 416               4                   0                0
## 417               3                   1                0
## 418               3                   1                1
## 419               3                   0                1
## 420               3                   0                1
## 421               4                   0                0
## 422               4                   0                0
## 423               3                   0                1
## 424               4                   1                0
## 425               4                   0                0
## 426               4                   1                0
## 427               4                   0                0
## 428               4                   0                0
## 429               3                   1                1
## 430               4                   0                0
## 431               4                   0                0
## 432               4                   1                0
## 433               3                   0                1
## 434               4                   0                0
## 435               4                   0                1
## 436               2                   1                0
## 437               4                   0                0
## 438               2                   1                1
## 439               2                   0                1
## 440               4                   0                1
## 441               4                   0                1
## 442               4                   0                0
## 443               2                   1                1
## 444               2                   1                1
## 445               4                   0                0
## 446               4                   0                0
## 447               3                   1                0
## 448               3                   0                0
## 449               2                   0                0
## 450               4                   0                1
## 451               4                   0                0
## 452               3                   0                1
## 453               3                   1                0
## 454               3                   0                1
## 455               2                   0                1
## 456               3                   1                0
## 457               3                   0                1
## 458               4                   0                0
## 459               4                   0                1
## 460               4                   0                0
## 461               3                   1                0
## 462               2                   1                0
## 463               3                   1                1
## 464               2                   1                1
## 465               3                   0                1
## 466               3                   0                1
## 467               2                   0                1
## 468               4                   1                0
## 469               4                   0                0
## 470               4                   0                0
## 471               3                   0                0
## 472               3                   1                1
## 473               4                   0                0
## 474               2                   0                1
## 475               3                   1                0
## 476               4                   0                0
## 477               4                   0                0
## 478               4                   0                0
## 479               4                   0                0
## 480               4                   0                0
## 481               3                   0                1
## 482               2                   0                1
## 483               2                   1                0
## 484               3                   1                1
## 485               2                   0                1
## 486               2                   1                1
## 487               3                   0                1
## 488               4                   0                0
## 489               4                   0                0
## 490               4                   0                0
## 491               4                   0                0
## 492               2                   1                1
## 493               4                   0                1
## 494               2                   1                1
## 495               2                   0                1
## 496               4                   0                0
## 497               4                   0                0
## 498               2                   1                1
## 499               4                   0                0
## 500               4                   0                0
##     PROBABILITY_INCORRECT
## 1                       1
## 2                       0
## 3                       0
## 4                       0
## 5                       1
## 6                       1
## 7                       1
## 8                       0
## 9                       0
## 10                      0
## 11                      0
## 12                      1
## 13                      0
## 14                      0
## 15                      0
## 16                      0
## 17                      1
## 18                      0
## 19                      1
## 20                      1
## 21                      0
## 22                      0
## 23                      0
## 24                      1
## 25                      0
## 26                      0
## 27                      0
## 28                      1
## 29                      0
## 30                      0
## 31                      0
## 32                      0
## 33                      0
## 34                      0
## 35                      0
## 36                      0
## 37                      1
## 38                      0
## 39                      0
## 40                      0
## 41                      0
## 42                      0
## 43                      0
## 44                      0
## 45                      0
## 46                      1
## 47                      0
## 48                      0
## 49                      1
## 50                      0
## 51                      1
## 52                      1
## 53                      1
## 54                      1
## 55                      0
## 56                      1
## 57                      0
## 58                      0
## 59                      0
## 60                      0
## 61                      0
## 62                      0
## 63                      1
## 64                      0
## 65                      0
## 66                      1
## 67                      0
## 68                      0
## 69                      0
## 70                      0
## 71                      0
## 72                      1
## 73                      1
## 74                      1
## 75                      0
## 76                      0
## 77                      0
## 78                      1
## 79                      1
## 80                      0
## 81                      0
## 82                      0
## 83                      1
## 84                      1
## 85                      0
## 86                      1
## 87                      0
## 88                      0
## 89                      0
## 90                      1
## 91                      1
## 92                      0
## 93                      0
## 94                      0
## 95                      0
## 96                      0
## 97                      1
## 98                      0
## 99                      1
## 100                     1
## 101                     0
## 102                     0
## 103                     0
## 104                     1
## 105                     0
## 106                     0
## 107                     1
## 108                     0
## 109                     0
## 110                     0
## 111                     0
## 112                     0
## 113                     0
## 114                     0
## 115                     0
## 116                     0
## 117                     0
## 118                     1
## 119                     0
## 120                     1
## 121                     0
## 122                     0
## 123                     0
## 124                     0
## 125                     0
## 126                     0
## 127                     0
## 128                     1
## 129                     0
## 130                     0
## 131                     0
## 132                     0
## 133                     0
## 134                     0
## 135                     1
## 136                     0
## 137                     0
## 138                     1
## 139                     0
## 140                     0
## 141                     0
## 142                     0
## 143                     0
## 144                     0
## 145                     0
## 146                     1
## 147                     0
## 148                     0
## 149                     0
## 150                     0
## 151                     0
## 152                     0
## 153                     1
## 154                     0
## 155                     1
## 156                     0
## 157                     0
## 158                     0
## 159                     0
## 160                     0
## 161                     0
## 162                     0
## 163                     0
## 164                     0
## 165                     0
## 166                     0
## 167                     0
## 168                     0
## 169                     0
## 170                     1
## 171                     0
## 172                     1
## 173                     0
## 174                     0
## 175                     0
## 176                     1
## 177                     0
## 178                     1
## 179                     1
## 180                     0
## 181                     0
## 182                     0
## 183                     1
## 184                     0
## 185                     0
## 186                     0
## 187                     0
## 188                     0
## 189                     0
## 190                     0
## 191                     0
## 192                     0
## 193                     0
## 194                     1
## 195                     0
## 196                     0
## 197                     0
## 198                     0
## 199                     1
## 200                     1
## 201                     1
## 202                     0
## 203                     0
## 204                     0
## 205                     1
## 206                     0
## 207                     0
## 208                     0
## 209                     0
## 210                     0
## 211                     0
## 212                     0
## 213                     0
## 214                     0
## 215                     0
## 216                     0
## 217                     0
## 218                     0
## 219                     1
## 220                     0
## 221                     0
## 222                     1
## 223                     0
## 224                     0
## 225                     0
## 226                     0
## 227                     0
## 228                     1
## 229                     0
## 230                     0
## 231                     1
## 232                     0
## 233                     0
## 234                     0
## 235                     0
## 236                     0
## 237                     1
## 238                     0
## 239                     0
## 240                     1
## 241                     1
## 242                     0
## 243                     0
## 244                     0
## 245                     0
## 246                     0
## 247                     1
## 248                     0
## 249                     0
## 250                     0
## 251                     0
## 252                     1
## 253                     0
## 254                     0
## 255                     1
## 256                     0
## 257                     0
## 258                     0
## 259                     0
## 260                     0
## 261                     1
## 262                     0
## 263                     1
## 264                     0
## 265                     0
## 266                     0
## 267                     0
## 268                     1
## 269                     0
## 270                     0
## 271                     1
## 272                     0
## 273                     0
## 274                     0
## 275                     0
## 276                     0
## 277                     0
## 278                     0
## 279                     1
## 280                     0
## 281                     0
## 282                     0
## 283                     0
## 284                     0
## 285                     0
## 286                     0
## 287                     0
## 288                     0
## 289                     0
## 290                     0
## 291                     0
## 292                     0
## 293                     0
## 294                     0
## 295                     0
## 296                     0
## 297                     1
## 298                     0
## 299                     0
## 300                     0
## 301                     1
## 302                     0
## 303                     0
## 304                     0
## 305                     1
## 306                     0
## 307                     0
## 308                     1
## 309                     0
## 310                     0
## 311                     1
## 312                     0
## 313                     1
## 314                     1
## 315                     0
## 316                     1
## 317                     1
## 318                     0
## 319                     0
## 320                     0
## 321                     0
## 322                     0
## 323                     0
## 324                     1
## 325                     0
## 326                     0
## 327                     0
## 328                     0
## 329                     0
## 330                     0
## 331                     0
## 332                     0
## 333                     1
## 334                     0
## 335                     0
## 336                     1
## 337                     0
## 338                     0
## 339                     0
## 340                     0
## 341                     1
## 342                     1
## 343                     0
## 344                     0
## 345                     0
## 346                     0
## 347                     0
## 348                     0
## 349                     0
## 350                     0
## 351                     0
## 352                     0
## 353                     1
## 354                     0
## 355                     0
## 356                     1
## 357                     0
## 358                     1
## 359                     0
## 360                     1
## 361                     0
## 362                     0
## 363                     0
## 364                     0
## 365                     0
## 366                     1
## 367                     0
## 368                     1
## 369                     0
## 370                     1
## 371                     0
## 372                     0
## 373                     0
## 374                     1
## 375                     1
## 376                     0
## 377                     0
## 378                     1
## 379                     1
## 380                     0
## 381                     0
## 382                     0
## 383                     0
## 384                     0
## 385                     0
## 386                     0
## 387                     0
## 388                     0
## 389                     0
## 390                     0
## 391                     1
## 392                     0
## 393                     0
## 394                     0
## 395                     0
## 396                     0
## 397                     1
## 398                     0
## 399                     0
## 400                     0
## 401                     0
## 402                     0
## 403                     0
## 404                     0
## 405                     0
## 406                     0
## 407                     1
## 408                     0
## 409                     0
## 410                     0
## 411                     0
## 412                     1
## 413                     0
## 414                     0
## 415                     1
## 416                     0
## 417                     1
## 418                     1
## 419                     0
## 420                     0
## 421                     0
## 422                     0
## 423                     0
## 424                     1
## 425                     0
## 426                     1
## 427                     0
## 428                     0
## 429                     0
## 430                     0
## 431                     0
## 432                     1
## 433                     0
## 434                     0
## 435                     0
## 436                     1
## 437                     0
## 438                     0
## 439                     1
## 440                     0
## 441                     0
## 442                     0
## 443                     0
## 444                     0
## 445                     0
## 446                     0
## 447                     1
## 448                     0
## 449                     0
## 450                     0
## 451                     0
## 452                     1
## 453                     1
## 454                     0
## 455                     1
## 456                     1
## 457                     0
## 458                     0
## 459                     0
## 460                     0
## 461                     1
## 462                     1
## 463                     1
## 464                     0
## 465                     0
## 466                     0
## 467                     1
## 468                     1
## 469                     0
## 470                     0
## 471                     0
## 472                     1
## 473                     0
## 474                     1
## 475                     1
## 476                     0
## 477                     0
## 478                     0
## 479                     0
## 480                     0
## 481                     0
## 482                     1
## 483                     1
## 484                     1
## 485                     1
## 486                     0
## 487                     0
## 488                     0
## 489                     0
## 490                     0
## 491                     0
## 492                     0
## 493                     0
## 494                     0
## 495                     1
## 496                     0
## 497                     0
## 498                     0
## 499                     0
## 500                     0
analytics@ensemble_summary # SUMMARY OF ENSEMBLE PRECISION/COVERAGE. USES THE n VARIABLE PASSED INTO create_analytics()
##        n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1                1.00              0.65
## n >= 2                1.00              0.65
## n >= 3                0.73              0.79
## n >= 4                0.46              0.86
#CONFUSION MATRIX
yhat = as.matrix(analytics@document_summary$CONSENSUS_CODE)
y = flag[(n/2+1):n]
print(table(y,yhat))
##    yhat
## y     0   1
##   0 238  21
##   1 154  87

Grading Text

In recent years, the SAT exams added a new essay section. While the test aimed at assessing original writing, it also introduced automated grading. A goal of the test is to assess the writing level of the student. This is associated with the notion of readability.

Readability

“Readability” is a metric of how easy it is to comprehend text. Given a goal of efficient markets, regulators want to foster transparency by making sure financial documents that are disseminated to the investing public are readable. Hence, metrics for readability are very important and are recently gaining traction.

Gunning-Fog Index

Gunning (1952) developed the Fog index. The index estimates the years of formal education needed to understand text on a first reading. A fog index of 12 requires the reading level of a U.S. high school senior (around 18 years old). The index is based on the idea that poor readability is associated with longer sentences and complex words. Complex words are those that have more than two syllables. The formula for the Fog index is

\[ 0.4 \cdot \left[\frac{\mbox{\#words}}{\mbox{\#sentences}} + 100 \cdot \left( \frac{\mbox{\#complex words}}{\mbox{\#words}} \right) \right] \]

Alternative readability scores use similar ideas. The Flesch Reading Ease Score and the Flesch-Kincaid Grade Level also use counts of words, syllables, and sentences. See http://en.wikipedia.org/wiki/Flesch-Kincaid_readability_tests. The Flesch Reading Ease Score is defined as

\[ 206.835 - 1.015 \left(\frac{\mbox{\#words}}{\mbox{\#sentences}}\right) - 84.6 \left( \frac{\mbox{\#syllables}}{\mbox{\#words}} \right) \]

With a range of 90-100 easily accessible by a 11-year old, 60-70 being easy to understand for 13-15 year olds, and 0-30 for university graduates.

The Flesch-Kincaid Grade Level

This is defined as

\[ 0.39 \left(\frac{\mbox{\#words}}{\mbox{\#sentences}}\right) + 11.8 \left( \frac{\mbox{\#syllables}}{\mbox{\#words}} \right) -15.59 \]

which gives a number that corresponds to the grade level. As expected these two measures are negatively correlated. Various other measures of readability use the same ideas as in the Fog index. For example the Coleman and Liau (1975) index does not even require a count of syllables, as follows:

\[ CLI = 0.0588 L - 0.296 S - 15.8 \]

where \(L\) is the average number of letters per hundred words and \(S\) is the average number of sentences per hundred words.

Standard readability metrics may not work well for financial text. Loughran and McDonald (2014) find that the Fog index is inferior to simply looking at 10-K file size.

References

M. Coleman and T. L. Liau. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology 60, 283-284.

T. Loughran and W. McDonald, (2014). Measuring readability in financial disclosures, The Journal of Finance 69, 1643-1671.

The koRpus package

R package koRpus for readability scoring here. http://www.inside-r.org/packages/cran/koRpus/docs/readability

First, let’s grab some text from my web site.

library(rvest)
url = "http://srdas.github.io/bio-candid.html"

doc.html = read_html(url)
text = doc.html %>% html_nodes("p") %>% html_text()

text = gsub("[\t\n]"," ",text)
text = gsub('"'," ",text)   #removes single backslash
text = paste(text, collapse=" ")
print(text)
## [1] " Sanjiv Das: A Short Academic Life History    After loafing and working in many parts of Asia, but never really growing up, Sanjiv moved to New York to change the world, hopefully through research.  He graduated in 1994 with a Ph.D. from NYU, and since then spent five years in Boston, and now lives in San Jose, California.  Sanjiv loves animals, places in the world where the mountains meet the sea, riding sport motorbikes, reading, gadgets, science fiction movies, and writing cool software code. When there is time available from the excitement of daily life, Sanjiv writes academic papers, which helps him relax. Always the contrarian, Sanjiv thinks that New York City is the most calming place in the world, after California of course.     Sanjiv is now a Professor of Finance at Santa Clara University. He came to SCU from Harvard Business School and spent a year at UC Berkeley. In his past life in the unreal world, Sanjiv worked at Citibank, N.A. in the Asia-Pacific region. He takes great pleasure in merging his many previous lives into his current existence, which is incredibly confused and diverse.     Sanjiv's research style is instilled with a distinct  New York state of mind  - it is chaotic, diverse, with minimal method to the madness. He has published articles on derivatives, term-structure models, mutual funds, the internet, portfolio choice, banking models, credit risk, and has unpublished articles in many other areas. Some years ago, he took time off to get another degree in computer science at Berkeley, confirming that an unchecked hobby can quickly become an obsession. There he learnt about the fascinating field of Randomized Algorithms, skills he now applies earnestly to his editorial work, and other pursuits, many of which stem from being in the epicenter of Silicon Valley.     Coastal living did a lot to mold Sanjiv, who needs to live near the ocean.  The many walks in Greenwich village convinced him that there is no such thing as a representative investor, yet added many unique features to his personal utility function. He learnt that it is important to open the academic door to the ivory tower and let the world in. Academia is a real challenge, given that he has to reconcile many more opinions than ideas. He has been known to have turned down many offers from Mad magazine to publish his academic work. As he often explains, you never really finish your education -  you can check out any time you like, but you can never leave.  Which is why he is doomed to a lifetime in Hotel California. And he believes that, if this is as bad as it gets, life is really pretty good.    "

Now we can assess it for readability.

library(koRpus)
## Warning: package 'koRpus' was built under R version 3.2.5
## 
## Attaching package: 'koRpus'
## The following object is masked from 'package:dplyr':
## 
##     query
## The following object is masked from 'package:qdap':
## 
##     SMOG
## The following object is masked from 'package:lsa':
## 
##     query
write(text,file="textvec.txt")
text_tokens = tokenize("textvec.txt",lang="en")
#print(text_tokens)
print(c("Number of sentences: ",text_tokens@desc$sentences))
## [1] "Number of sentences: " "24"
print(c("Number of words: ",text_tokens@desc$words))
## [1] "Number of words: " "446"
print(c("Number of words per sentence: ",text_tokens@desc$avg.sentc.length))
## [1] "Number of words per sentence: " "18.5833333333333"
print(c("Average length of words: ",text_tokens@desc$avg.word.length))
## [1] "Average length of words: " "4.67488789237668"

Next we generate several indices of readability, which are worth looking at.

print(readability(text_tokens))
## Hyphenation (language: en)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |                                                                 |   1%
  |                                                                       
  |=                                                                |   1%
  |                                                                       
  |=                                                                |   2%
  |                                                                       
  |==                                                               |   2%
  |                                                                       
  |==                                                               |   3%
  |                                                                       
  |==                                                               |   4%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |====                                                             |   6%
  |                                                                       
  |====                                                             |   7%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=======                                                          |  11%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |========                                                         |  13%
  |                                                                       
  |=========                                                        |  13%
  |                                                                       
  |=========                                                        |  14%
  |                                                                       
  |=========                                                        |  15%
  |                                                                       
  |==========                                                       |  15%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |===========                                                      |  16%
  |                                                                       
  |===========                                                      |  17%
  |                                                                       
  |============                                                     |  18%
  |                                                                       
  |============                                                     |  19%
  |                                                                       
  |=============                                                    |  19%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |=============                                                    |  21%
  |                                                                       
  |==============                                                   |  21%
  |                                                                       
  |==============                                                   |  22%
  |                                                                       
  |===============                                                  |  22%
  |                                                                       
  |===============                                                  |  23%
  |                                                                       
  |===============                                                  |  24%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |=================                                                |  27%
  |                                                                       
  |==================                                               |  27%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |===================                                              |  28%
  |                                                                       
  |===================                                              |  29%
  |                                                                       
  |===================                                              |  30%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |====================                                             |  31%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=====================                                            |  33%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |======================                                           |  35%
  |                                                                       
  |=======================                                          |  35%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |========================                                         |  36%
  |                                                                       
  |========================                                         |  37%
  |                                                                       
  |========================                                         |  38%
  |                                                                       
  |=========================                                        |  38%
  |                                                                       
  |=========================                                        |  39%
  |                                                                       
  |==========================                                       |  39%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |==========================                                       |  41%
  |                                                                       
  |===========================                                      |  41%
  |                                                                       
  |===========================                                      |  42%
  |                                                                       
  |============================                                     |  42%
  |                                                                       
  |============================                                     |  43%
  |                                                                       
  |============================                                     |  44%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |=============================                                    |  45%
  |                                                                       
  |==============================                                   |  46%
  |                                                                       
  |==============================                                   |  47%
  |                                                                       
  |===============================                                  |  47%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |================================                                 |  49%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================                                |  50%
  |                                                                       
  |=================================                                |  51%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |==================================                               |  53%
  |                                                                       
  |===================================                              |  53%
  |                                                                       
  |===================================                              |  54%
  |                                                                       
  |====================================                             |  55%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |=====================================                            |  57%
  |                                                                       
  |=====================================                            |  58%
  |                                                                       
  |======================================                           |  58%
  |                                                                       
  |======================================                           |  59%
  |                                                                       
  |=======================================                          |  59%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |=======================================                          |  61%
  |                                                                       
  |========================================                         |  61%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |=========================================                        |  62%
  |                                                                       
  |=========================================                        |  63%
  |                                                                       
  |=========================================                        |  64%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |==========================================                       |  65%
  |                                                                       
  |===========================================                      |  65%
  |                                                                       
  |===========================================                      |  66%
  |                                                                       
  |===========================================                      |  67%
  |                                                                       
  |============================================                     |  67%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |=============================================                    |  69%
  |                                                                       
  |=============================================                    |  70%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |==============================================                   |  71%
  |                                                                       
  |==============================================                   |  72%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |===============================================                  |  73%
  |                                                                       
  |================================================                 |  73%
  |                                                                       
  |================================================                 |  74%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |==================================================               |  76%
  |                                                                       
  |==================================================               |  77%
  |                                                                       
  |==================================================               |  78%
  |                                                                       
  |===================================================              |  78%
  |                                                                       
  |===================================================              |  79%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |====================================================             |  81%
  |                                                                       
  |=====================================================            |  81%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |======================================================           |  83%
  |                                                                       
  |======================================================           |  84%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=======================================================          |  85%
  |                                                                       
  |========================================================         |  85%
  |                                                                       
  |========================================================         |  86%
  |                                                                       
  |========================================================         |  87%
  |                                                                       
  |=========================================================        |  87%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |==========================================================       |  89%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |===========================================================      |  90%
  |                                                                       
  |===========================================================      |  91%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |============================================================     |  93%
  |                                                                       
  |=============================================================    |  93%
  |                                                                       
  |=============================================================    |  94%
  |                                                                       
  |==============================================================   |  95%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |===============================================================  |  96%
  |                                                                       
  |===============================================================  |  97%
  |                                                                       
  |===============================================================  |  98%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |================================================================ |  99%
  |                                                                       
  |=================================================================|  99%
  |                                                                       
  |=================================================================| 100%
## Warning: Bormuth: Missing word list, hence not calculated.
## Warning: Coleman: POS tags are not elaborate enough, can't count pronouns
## and prepositions. Formulae skipped.
## Warning: Dale-Chall: Missing word list, hence not calculated.
## Warning: DRP: Missing Bormuth Mean Cloze, hence not calculated.
## Warning: Harris.Jacobson: Missing word list, hence not calculated.
## Warning: Spache: Missing word list, hence not calculated.
## Warning: Traenkle.Bailer: POS tags are not elaborate enough, can't count
## prepositions and conjuctions. Formulae skipped.
## Warning: Note: The implementations of these formulas are still subject to validation:
##   Coleman, Danielson.Bryan, Dickes.Steiwer, ELF, Fucks, Harris.Jacobson, nWS, Strain, Traenkle.Bailer, TRI
##   Use the results with caution, even if they seem plausible!
## 
## Automated Readability Index (ARI)
##   Parameters: default 
##        Grade: 9.88 
## 
## 
## Coleman-Liau
##   Parameters: default 
##          ECP: 47% (estimted cloze percentage)
##        Grade: 10.09 
##        Grade: 10.1 (short formula)
## 
## 
## Danielson-Bryan
##   Parameters: default 
##          DB1: 7.64 
##          DB2: 48.58 
##        Grade: 9-12 
## 
## 
## Dickes-Steiwer's Handformel
##   Parameters: default 
##          TTR: 0.58 
##        Score: 42.76 
## 
## 
## Easy Listening Formula
##   Parameters: default 
##       Exsyls: 149 
##        Score: 6.21 
## 
## 
## Farr-Jenkins-Paterson
##   Parameters: default 
##           RE: 56.1 
##        Grade: >= 10 (high school) 
## 
## 
## Flesch Reading Ease
##   Parameters: en (Flesch) 
##           RE: 59.75 
##        Grade: >= 10 (high school) 
## 
## 
## Flesch-Kincaid Grade Level
##   Parameters: default 
##        Grade: 9.54 
##          Age: 14.54 
## 
## 
## Gunning Frequency of Gobbledygook (FOG)
##   Parameters: default 
##        Grade: 12.55 
## 
## 
## FORCAST
##   Parameters: default 
##        Grade: 10.01 
##          Age: 15.01 
## 
## 
## Fucks' Stilcharakteristik
##        Score: 86.88 
##        Grade: 9.32 
## 
## 
## Linsear Write
##   Parameters: default 
##   Easy words: 87 
##   Hard words: 13 
##        Grade: 11.71 
## 
## 
## Läsbarhetsindex (LIX)
##   Parameters: default 
##        Index: 40.56 
##       Rating: standard 
##        Grade: 6 
## 
## 
## Neue Wiener Sachtextformeln
##   Parameters: default 
##        nWS 1: 5.42 
##        nWS 2: 5.97 
##        nWS 3: 6.28 
##        nWS 4: 6.81 
## 
## 
## Readability Index (RIX)
##   Parameters: default 
##        Index: 4.08 
##        Grade: 9 
## 
## 
## Simple Measure of Gobbledygook (SMOG)
##   Parameters: default 
##        Grade: 12.01 
##          Age: 17.01 
## 
## 
## Strain Index
##   Parameters: default 
##        Index: 8.45 
## 
## 
## Kuntzsch's Text-Redundanz-Index
##   Parameters: default 
##  Short words: 297 
##  Punctuation: 71 
##      Foreign: 0 
##        Score: -56.22 
## 
## 
## Tuldava's Text Difficulty Formula
##   Parameters: default 
##        Index: 4.43 
## 
## 
## Wheeler-Smith
##   Parameters: default 
##        Score: 62.08 
##        Grade: > 4 
## 
## Text language: en

Text Summarization

It is really easy to write a summarizer in a few lines of code. The function below takes in a text array and does the needful. Each element of the array is one sentence of the document we wan summarized.

In the function we need to calculate how similar each sentence is to any other one. This could be done using cosine similarity, but here we use another approach, Jaccard similarity. Given two sentences, Jaccard similarity is the ratio of the size of the intersection word set divided by the size of the union set.

Jaccard Similarity

A document \(D\) is comprised of \(m\) sentences \(s_i, i=1,2,...,m\), where each \(s_i\) is a set of words. We compute the pairwise overlap between sentences using the Jaccard similarity index:

\[ J_{ij} = J(s_i, s_j) = \frac{|s_i \cap s_j|}{|s_i \cup s_j|} = J_{ji} \]

The overlap is the ratio of the size of the intersect of the two word sets in sentences \(s_i\) and \(s_j\), divided by the size of the union of the two sets. The similarity score of each sentence is computed as the row sums of the Jaccard similarity matrix.

\[ {\cal S}_i = \sum_{j=1}^m J_{ij} \]

Generating the summary

Once the row sums are obtained, they are sorted and the summary is the first \(n\) sentences based on the \({\cal S}_i\) values.

# FUNCTION TO RETURN n SENTENCE SUMMARY
# Input: array of sentences (text)
# Output: n most common intersecting sentences
text_summary = function(text, n) {
  m = length(text)  # No of sentences in input
  jaccard = matrix(0,m,m)  #Store match index
  for (i in 1:m) {
    for (j in i:m) {
      a = text[i]; aa = unlist(strsplit(a," "))
      b = text[j]; bb = unlist(strsplit(b," "))
      jaccard[i,j] = length(intersect(aa,bb))/
                          length(union(aa,bb))
      jaccard[j,i] = jaccard[i,j]
    }
  }
  similarity_score = rowSums(jaccard)
  res = sort(similarity_score, index.return=TRUE,
          decreasing=TRUE)
  idx = res$ix[1:n]
  summary = text[idx]
}

Example: Summarization

We will use a sample of text that I took from Bloomberg news. It is about the need for data scientists.

url = "data_files/dstext_sample.txt"   #You can put any text file or URL here
text = read_web_page(url,cstem=0,cstop=0,ccase=0,cpunc=0,cflat=1)
print(length(text[[1]]))
## [1] 1
print("ORIGINAL TEXT")
## [1] "ORIGINAL TEXT"
print(text)
## [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver.  Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data. Data scientists were meant to be the answer to this issue. Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data. This has created a huge market for people with these skills. US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer. And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data.  It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%.  However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists. May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets. The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole. This theme of centralized vs. decentralized decision-making is one that has long been debated in the management literature.  For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience. Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs.   But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets. Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself. Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear.  He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’. But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves. One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan. They reviewed the workings of large US organisations over fifteen years from the mid-80s. What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level. Their research indicated that decentralisation pays. And technological advancement often goes hand-in-hand with decentralization. Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer. Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources. They can do it themselves, in just minutes.  The decentralization trend is now impacting on technology spending. According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling. Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable. Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands. But this approach is not necessarily always adopted. For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e. using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget. The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e. how do people actually deliver value from data assets. Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative. As ever then, the real value from data comes from asking the right questions of the data. And the right questions to ask only emerge if you are close enough to the business to see them. Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data. Which probably means that data scientists’ salaries will need to take a hit in the process."
text2 = strsplit(text,". ",fixed=TRUE)  #Special handling of the period.
text2 = text2[[1]]
print("SENTENCES")
## [1] "SENTENCES"
print(text2)
##  [1] "THERE HAVE BEEN murmurings that we are now in the “trough of disillusionment” of big data, the hype around it having surpassed the reality of what it can deliver"                                                                                                                                                     
##  [2] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
##  [3] "Data scientists were meant to be the answer to this issue"                                                                                                                                                                                                                                                             
##  [4] "Indeed, Hal Varian, Chief Economist at Google famously joked that “The sexy job in the next 10 years will be statisticians.” He was clearly right as we are now used to hearing that data scientists are the key to unlocking the value of big data"                                                                   
##  [5] "This has created a huge market for people with these skills"                                                                                                                                                                                                                                                           
##  [6] "US recruitment agency, Glassdoor, report that the average salary for a data scientist is $118,709 versus $64,537 for a skilled programmer"                                                                                                                                                                             
##  [7] "And a McKinsey study predicts that by 2018, the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and a 1.5 million shortage of managers with the skills to understand and make decisions based on analysis of big data"                                                     
##  [8] " It’s no wonder that companies are keen to employ data scientists when, for example, a retailer using big data can reportedly increase their margin by more than 60%"                                                                                                                                                  
##  [9] " However, is it really this simple? Can data scientists actually justify earning their salaries when brands seem to be struggling to realize the promise of big data? Perhaps we are expecting too much of data scientists"                                                                                            
## [10] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
## [11] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
## [12] "This theme of centralized vs"                                                                                                                                                                                                                                                                                          
## [13] "decentralized decision-making is one that has long been debated in the management literature"                                                                                                                                                                                                                          
## [14] " For many organisations a centralized structure helps maintain control over a vast international operation, plus ensures consistency of customer experience"                                                                                                                                                           
## [15] "Others, meanwhile, may give managers at a local level decision-making power particularly when it comes to tactical needs"                                                                                                                                                                                              
## [16] "  But the issue urgently needs revisiting in the context of big data as the way in which organisations manage themselves around data may well be a key factor for brands in realizing the value of their data assets"                                                                                                  
## [17] "Economist and philosopher Friedrich Hayek took the view that organisations should consider the purpose of the information itself"                                                                                                                                                                                      
## [18] "Centralized decision-making can be more cost-effective and co-ordinated, he believed, but decentralization can add speed and local information that proves more valuable, even if the bigger picture is less clear"                                                                                                    
## [19] " He argued that organisations thought too highly of centralized knowledge, while ignoring ‘knowledge of the particular circumstances of time and place’"                                                                                                                                                               
## [20] "But it is only relatively recently that economists are starting to accumulate data that allows them to gauge how successful organisations organize themselves"                                                                                                                                                         
## [21] "One such exercise reported by Tim Harford was carried out by Harvard Professor Julie Wulf and the former chief economist of the International Monetary Fund, Raghuram Rajan"                                                                                                                                           
## [22] "They reviewed the workings of large US organisations over fifteen years from the mid-80s"                                                                                                                                                                                                                              
## [23] "What they found was successful companies were often associated with a move towards decentralisation, often driven by globalisation and the need to react promptly to a diverse and swiftly-moving range of markets, particularly at a local level"                                                                     
## [24] "Their research indicated that decentralisation pays"                                                                                                                                                                                                                                                                   
## [25] "And technological advancement often goes hand-in-hand with decentralization"                                                                                                                                                                                                                                           
## [26] "Data analytics is starting to filter down to the department layer, where executives are increasingly eager to trawl through the mass of information on offer"                                                                                                                                                          
## [27] "Cloud computing, meanwhile, means that line managers no longer rely on IT teams to deploy computer resources"                                                                                                                                                                                                          
## [28] "They can do it themselves, in just minutes"                                                                                                                                                                                                                                                                            
## [29] " The decentralization trend is now impacting on technology spending"                                                                                                                                                                                                                                                   
## [30] "According to Gartner, chief marketing officers have been given the same purchasing power in this area as IT managers and, as their spending rises, so that of data centre managers is falling"                                                                                                                         
## [31] "Tim Harford makes a strong case for the way in which this decentralization is important given that the environment in which we operate is so unpredictable"                                                                                                                                                            
## [32] "Innovation typically comes, he argues from a “swirling mix of ideas not from isolated minds.” And he cites Jane Jacobs, writer on urban planning– who suggested we find innovation in cities rather than on the Pacific islands"                                                                                       
## [33] "But this approach is not necessarily always adopted"                                                                                                                                                                                                                                                                   
## [34] "For example, research by academics Donald Marchand and Joe Peppard discovered that there was still a tendency for brands to approach big data projects the same way they would existing IT projects: i.e"                                                                                                              
## [35] "using centralized IT specialists with a focus on building and deploying technology on time, to plan, and within budget"                                                                                                                                                                                                
## [36] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"                                                                                                                                                          
## [37] "how do people actually deliver value from data assets"                                                                                                                                                                                                                                                                 
## [38] "Marchand and Peppard suggest (among other recommendations) that those who need to be able to create meaning from data should be at the heart of any initiative"                                                                                                                                                        
## [39] "As ever then, the real value from data comes from asking the right questions of the data"                                                                                                                                                                                                                              
## [40] "And the right questions to ask only emerge if you are close enough to the business to see them"                                                                                                                                                                                                                        
## [41] "Are data scientists earning their salary? In my view they are a necessary but not sufficient part of the solution; brands need to be making greater investment in working with a greater range of users to help them ask questions of the data"                                                                        
## [42] "Which probably means that data scientists’ salaries will need to take a hit in the process."
print("SUMMARY")
## [1] "SUMMARY"
res = text_summary(text2,5)
print(res)
## [1] " Gartner suggested that the “gravitational pull of big data is now so strong that even people who haven’t a clue as to what it’s all about report that they’re running big data projects.”  Indeed, their research with business decision makers suggests that organisations are struggling to get value from big data"
## [2] "The focus on the data scientist often implies a centralized approach to analytics and decision making; we implicitly assume that a small team of highly skilled individuals can meet the needs of the organisation as a whole"                                                                                         
## [3] "May be we are investing too much in a relatively small number of individuals rather than thinking about how we can design organisations to help us get the most from data assets"                                                                                                                                      
## [4] "The problem with a centralized ‘IT-style’ approach is that it ignores the human side of the process of considering how people create and use information i.e"                                                                                                                                                          
## [5] "Which probably means that data scientists’ salaries will need to take a hit in the process."

Text Mining Research in Finance

In this segment we explore various text mining research in the field of finance.

  1. Lu, Chen, Chen, Hung, and Li (2010) categorize finance related textual content into three categories: (a) forums, blogs, and wikis; (b) news and research reports; and (c) content generated by firms.

  2. Extracting sentiment and other information from messages posted to stock message boards such as Yahoo!, Motley Fool, Silicon Investor, Raging Bull, etc., see Tumarkin and Whitelaw (2001), Antweiler and Frank (2004), Antweiler and Frank (2005), Das, Martinez-Jerez and Tufano (2005), Das and Chen (2007).

  3. Other news sources: Lexis-Nexis, Factiva, Dow Jones News, etc., see Das, Martinez-Jerez and Tufano (2005); Boudoukh, Feldman, Kogan, Richardson (2012).

  4. The Heard on the Street column in the Wall Street Journal has been used in work by Tetlock (2007), Tetlock, Saar-Tsechansky and Macskassay (2008); see also the use of Wall Street Journal articles by Lu, Chen, Chen, Hung, and Li (2010).

  5. Thomson-Reuters NewsScope Sentiment Engine (RNSE) based on Infonics/Lexalytics algorithms and varied data on stocks and text from internal databases, see Leinweber and Sisk (2011). Zhang and Skiena (2010) develop a market neutral trading strategy using news media such as tweets, over 500 newspapers, Spinn3r RSS feeds, and LiveJournal.

Das and Chen (Management Science 2007)

Using Twitter and Facebook for Market Prediction

  1. Bollen, Mao, and Zeng (2010) claimed that stock direction of the Dow Jones Industrial Average can be predicted using tweets with 87.6% accuracy.

  2. Bar-Haim, Dinur, Feldman, Fresko and Goldstein (2011) attempt to predict stock direction using tweets by detecting and overweighting the opinion of expert investors.

  3. Brown (2012) looks at the correlation between tweets and the stock market via several measures.

  4. Logunov (2011) uses OpinionFinder to generate many measures of sentiment from tweets.

  5. Twitter based sentiment developed by Rao and Srivastava (2012) is found to be highly correlated with stock prices and indexes, as high as 0.88 for returns.

  6. Sprenger and Welpe (2010) find that tweet bullishness is associated with abnormal stock returns and tweet volume predicts trading volume.

Polarity and Subjectivity

Zhang and Skiena (2010) use Twitter feeds and also three other sources of text: over 500 nationwide newspapers, RSS feeds from blogs, and LiveJournal blogs. These are used to compute two metrics.

\[ \begin{eqnarray*} \mbox{polarity} &=& \frac{n_{pos} - n_{neg}}{n_{pos} + n_{neg}} \\ \mbox{subjectivity} &=& \frac{n_{pos} + n_{neg}}{N} \end{eqnarray*} \]

where \(N\) is the total number of words in a text document, \(n_{pos}, n_{neg}\) are the number of positive and negative words, respectively.

Logunov (2011) uses tweets data, and applies OpinionFinder and also developed a new classifier called Naive Emoticon Classification to encode sentiment. This is an unusual and original, albeit quite intuitive use of emoticons to determine mood in text mining. If an emoticon exists, then the tweet is automatically coded with that sentiment of emotion. Four types of emoticons are considered: Happy (H), Sad (S), Joy (J), and Cry (C). Polarity is defined here as \[ \mbox{polarity} = A = \frac{n_H + n_J}{n_H + n_S + n_J + n_C} \] Values greater than 0.5 are positive. \(A\) stands for aggregate sentiment and appears to be strongly autocorrelated. Overall, prediction evidence is weak.

Text Mining Corporate Reports

There is a proliferation of word-weighting schemes.The idea of ``inverse document frequency’’ (\(idf\)) as a weighting coefficient. Hence, the \(idf\) for word \(j\) would be

\[ w_j^{idf} = \ln \left( \frac{N}{df_j} \right) \] where \(N\) is the total number of documents, and \(df_j\) is the number of documents containing word \(j\). This scheme was proposed by Manning and Schutze (1999).

Tone

Using the MD&A

Readability of Financial Reports

Corporate Finance and Risk Management

  1. Sprenger (2011) integrates data from text classification of tweets, user voting, and a proprietary stock game to extract the bullishness of online investors; these ideas are behind the site http://TweetTrader.net.

  2. Tweets also pose interesting problems of big streaming data discussed in Pervin, Fang, Datta, and Dutta (2013).

  3. Data used here is from filings such as 10-Ks, etc., (Loughran and McDonald (2011); Burdick et al (2011); Bodnaruk, Loughran, and McDonald (2013); Jegadeesh and Wu (2013); Loughran and McDonald (2014)).

Predicting Markets

  1. Wysocki (1999) found that for the 50 top firms in message posting volume on Yahoo! Finance, message volume predicted next day abnormal stock returns. Using a broader set of firms, he also found that high message volume firms were those with inflated valuations (relative to fundamentals), high trading volume, high short seller activity (given possibly inflated valuations), high analyst following (message posting appears to be related to news as well, correlated with a general notion of “attention” stocks), and low institutional holdings (hence broader investor discussion and interest), all intuitive outcomes.

  2. Bagnoli, Beneish, and Watts (1999) examined earnings “whispers”, unofficial crowd-sourced forecasts of quarterly earnings from small investors, are more accurate than that of First Call analyst forecasts.

  3. Tumarkin and Whitelaw (2001) examined self-reported sentiment on the Raging Bull message board and found no predictive content, either of returns or volume.

Bullishness Index

Antweiler and Frank (2004) used the Naive Bayes algorithm for classification, implemented in the {Rainbow} package of Andrew McCallum (1996). They also repeated the same using Support Vector Machines (SVMs) as a robustness check. Both algorithms generate similar empirical results. Once the algorithm is trained, they use it out-of-sample to sign each message as \(\{Buy, Hold, Sell\}\). Let \(n_B, n_S\) be the number of buy and sell messages, respectively. Then \(R = n_B/n_S\) is just the ration of buy to sell messages. Based on this they define their bullishness index

\[ B = \frac{n_B - n_S}{n_B + n_S} = \frac{R-1}{R+1} \in (-1,+1) \]

This metric is independent of the number of messages, i.e., is homogenous of degree zero in \(n_B,n_S\). An alternative measure is also proposed, i.e.,

\[ \begin{eqnarray*} B^* &=& \ln\left[\frac{1+n_B}{1+n_S} \right] \\ &=& \ln\left[\frac{1+R(1+n_B+n_S)}{1+R+n_B+n_S} \right] \\ &=& \ln\left[\frac{2+(n_B+n_S)(1+B)}{2+(n_B+n_S)(1-B)} \right] \\ & \approx & B \cdot \ln(1+n_B+n_S) \end{eqnarray*} \]

This measure takes the bullishness index \(B\) and weights it by the number of messages of both categories. This is homogenous of degree between zero and one. And they also propose a third measure, which is much more direct, i.e.,

\[ B^{**} = n_B - n_S = (n_B+n_S) \cdot \frac{R-1}{R+1} = M \cdot B \]

which is homogenous of degree one, and is a message weighted bullishness index. They prefer to use \(B^*\) in their algorithms as it appears to deliver the best predictive results. Finally, produce an agreement index,

\[ A = 1 - \sqrt{1-B^2} \in (0,1) \]

Note how closely this is related to the disagreement index seen earlier.

Commercial Developments

IBM’s Midas System

Stock Twits

iSentium

RavenPack

Possibile Applications for Finance Firms

An illustrative list of applications for finance firms is as follows:

Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is an approach for reducing the dimension of the Term-Document Matrix (TDM), or the corresponding Document-Term Matrix (DTM), in general used interchangeably, unless a specific one is invoked. Dimension reduction of the TDM offers two benefits:

How is LSA implemented using SVD?

LSA is the application of Singular Value Decomposition (SVD) to the TDM, extracted from a text corpus. Define the TDM to be a matrix \(M \in {\cal R}^{m \times n}\), where \(m\) is the number of terms and \(n\) is the number of documents.

The SVD of matrix \(M\) is given by \[ M = T \cdot S \cdot D^\top \] where \(T \in {\cal R}^{m \times n}\) and \(D \in {\cal R}^{n \times n}\) are orthonormal to each other, and \(S \in {\cal R}^{n \times n}\) is the “singluar values” matrix, i.e., a diagonal matrix with singular values on the diagonal. These values denote the relative importance of the terms in the TDM.

Example

Create a temporary directory and add some documents to it. This is a modification of the example in the lsa package

system("mkdir D")
write( c("blue", "red", "green"), file=paste("D", "D1.txt", sep="/"))
write( c("black", "blue", "red"), file=paste("D", "D2.txt", sep="/"))
write( c("yellow", "black", "green"), file=paste("D", "D3.txt", sep="/"))
write( c("yellow", "red", "black"), file=paste("D", "D4.txt", sep="/"))

Create a TDM using the textmatrix function.

library(lsa)
tdm = textmatrix("D",minWordLength=1)
print(tdm)
##         docs
## terms    D1.txt D2.txt D3.txt D4.txt
##   blue        1      1      0      0
##   green       1      0      1      0
##   red         1      1      0      1
##   black       0      1      1      1
##   yellow      0      0      1      1

Remove the extra directory.

system("rm -rf D")

So, what does SVD do?

SVD tries to connect the correlation matrix of terms (\(M \cdot M^\top\)) with the correlation matrix of documents (\(M^\top \cdot M\)) through the singular matrix.

To see this connection, note that matrix \(T\) contains the eigenvectors of the correlation matrix of terms. Likewise, the matrix \(D\) contains the eigenvectors of the correlation matrix of documents. To see this, let’s compute

et = eigen(tdm %*% t(tdm))$vectors
print(et)
##            [,1]          [,2]        [,3]          [,4]       [,5]
## [1,] -0.3629044 -6.015010e-01 -0.06829369  3.717480e-01  0.6030227
## [2,] -0.3328695 -2.220446e-16 -0.89347008  5.551115e-16 -0.3015113
## [3,] -0.5593741 -3.717480e-01  0.31014767 -6.015010e-01 -0.3015113
## [4,] -0.5593741  3.717480e-01  0.31014767  6.015010e-01 -0.3015113
## [5,] -0.3629044  6.015010e-01 -0.06829369 -3.717480e-01  0.6030227
ed = eigen(t(tdm) %*% tdm)$vectors
print(ed)
##            [,1]      [,2]       [,3]      [,4]
## [1,] -0.4570561  0.601501 -0.5395366 -0.371748
## [2,] -0.5395366  0.371748  0.4570561  0.601501
## [3,] -0.4570561 -0.601501 -0.5395366  0.371748
## [4,] -0.5395366 -0.371748  0.4570561 -0.601501

Dimension reduction of the TDM via LSA

If we wish to reduce the dimension of the latent semantic space to \(k < n\) then we use only the first \(k\) eigenvectors. The lsa function does this automatically.

We call LSA and ask it to automatically reduce the dimension of the TDM using a built-in function dimcalc_share.

res = lsa(tdm,dims=dimcalc_share())
print(res)
## $tk
##              [,1]          [,2]
## blue   -0.3629044 -6.015010e-01
## green  -0.3328695 -5.551115e-17
## red    -0.5593741 -3.717480e-01
## black  -0.5593741  3.717480e-01
## yellow -0.3629044  6.015010e-01
## 
## $dk
##              [,1]      [,2]
## D1.txt -0.4570561 -0.601501
## D2.txt -0.5395366 -0.371748
## D3.txt -0.4570561  0.601501
## D4.txt -0.5395366  0.371748
## 
## $sk
## [1] 2.746158 1.618034
## 
## attr(,"class")
## [1] "LSAspace"

We can see that the dimension has been reduced from \(n=4\) to \(n=2\). The output is shown for both the term matrix and the document matrix, both of which have only two columns. Think of these as the two “principal semantic components” of the TDM.

Compare the output of the LSA to the eigenvectors above to see that it is exactly that. The singular values in the ouput are connected to SVD as follows.

LSA and SVD: the connection?

First of all we see that the lsa function is nothing but the svd function in base R.

res2 = svd(tdm)
print(res2)
## $d
## [1] 2.746158 1.618034 1.207733 0.618034
## 
## $u
##            [,1]          [,2]        [,3]          [,4]
## [1,] -0.3629044 -6.015010e-01  0.06829369  3.717480e-01
## [2,] -0.3328695 -5.551115e-17  0.89347008 -3.455569e-15
## [3,] -0.5593741 -3.717480e-01 -0.31014767 -6.015010e-01
## [4,] -0.5593741  3.717480e-01 -0.31014767  6.015010e-01
## [5,] -0.3629044  6.015010e-01  0.06829369 -3.717480e-01
## 
## $v
##            [,1]      [,2]       [,3]      [,4]
## [1,] -0.4570561 -0.601501  0.5395366 -0.371748
## [2,] -0.5395366 -0.371748 -0.4570561  0.601501
## [3,] -0.4570561  0.601501  0.5395366  0.371748
## [4,] -0.5395366  0.371748 -0.4570561 -0.601501

The output here is the same as that of LSA except it is provided for \(n=4\). So we have four columns in \(T\) and \(D\) rather than two. Compare the results here to the previous two slides to see the connection.

What is the rank of the TDM?

We may reconstruct the TDM using the result of the LSA.

tdm_lsa = res$tk %*% diag(res$sk) %*% t(res$dk)
print(tdm_lsa)
##            D1.txt    D2.txt     D3.txt    D4.txt
## blue    1.0409089 0.8995016 -0.1299115 0.1758948
## green   0.4178005 0.4931970  0.4178005 0.4931970
## red     1.0639006 1.0524048  0.3402938 0.6051912
## black   0.3402938 0.6051912  1.0639006 1.0524048
## yellow -0.1299115 0.1758948  1.0409089 0.8995016

We see the new TDM after the LSA operation, it has non-integer frequency counts, but it may be treated in the same way as the original TDM. The document vectors populate a slightly different hyperspace.

LSA reduces the rank of the correlation matrix of terms \(M \cdot M^\top\) to \(n=2\). Here we see the rank before and after LSA.

library(Matrix)
## Warning: package 'Matrix' was built under R version 3.2.5
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:qdap':
## 
##     %&%
print(rankMatrix(tdm))
## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15
print(rankMatrix(tdm_lsa))
## [1] 2
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 1.110223e-15

Topic Analysis with Latent Dirichlet Allocation (LDA)

What does LDA have to do with LSA?

It is similar to LSA, in that it seeks to find the most related words and cluster them into topics. It uses a Bayesian approach to do this, but more on that later. Here, let’s just do an example to see how we might use the topicmodels package.

#Load the package
library(topicmodels)
## Warning: package 'topicmodels' was built under R version 3.2.5
#Load data on news articles from Associated Press
data(AssociatedPress)
print(dim(AssociatedPress))
## [1]  2246 10473

This is a large DTM (not TDM). It has more than 10,000 terms, and more than 2,000 documents. This is very large and LDA will take some time, so let’s run it on a subset of the documents.

dtm = AssociatedPress[1:100,]
dim(dtm)
## [1]   100 10473

Now we run LDA on this data set

#Set parameters for Gibbs sampling
burnin = 4000
iter = 2000
thin = 500
seed = list(2003,5,63,100001,765)
nstart = 5
best = TRUE

#Number of topics
k = 5
#Run LDA
res <-LDA(dtm, k, method="Gibbs", control = list(nstart = nstart, seed = seed, best = best, burnin = burnin, iter = iter, thin = thin))

#Show topics
res.topics = as.matrix(topics(res))
print(res.topics)
##        [,1]
##   [1,]    5
##   [2,]    4
##   [3,]    5
##   [4,]    1
##   [5,]    1
##   [6,]    4
##   [7,]    2
##   [8,]    1
##   [9,]    5
##  [10,]    5
##  [11,]    5
##  [12,]    3
##  [13,]    1
##  [14,]    4
##  [15,]    2
##  [16,]    3
##  [17,]    1
##  [18,]    1
##  [19,]    2
##  [20,]    3
##  [21,]    5
##  [22,]    2
##  [23,]    2
##  [24,]    1
##  [25,]    2
##  [26,]    4
##  [27,]    4
##  [28,]    2
##  [29,]    4
##  [30,]    3
##  [31,]    2
##  [32,]    1
##  [33,]    4
##  [34,]    1
##  [35,]    5
##  [36,]    4
##  [37,]    1
##  [38,]    4
##  [39,]    4
##  [40,]    2
##  [41,]    2
##  [42,]    2
##  [43,]    1
##  [44,]    1
##  [45,]    5
##  [46,]    3
##  [47,]    2
##  [48,]    3
##  [49,]    1
##  [50,]    4
##  [51,]    1
##  [52,]    2
##  [53,]    3
##  [54,]    1
##  [55,]    3
##  [56,]    4
##  [57,]    4
##  [58,]    2
##  [59,]    5
##  [60,]    2
##  [61,]    2
##  [62,]    3
##  [63,]    2
##  [64,]    1
##  [65,]    2
##  [66,]    4
##  [67,]    5
##  [68,]    2
##  [69,]    4
##  [70,]    5
##  [71,]    5
##  [72,]    5
##  [73,]    2
##  [74,]    5
##  [75,]    2
##  [76,]    1
##  [77,]    1
##  [78,]    1
##  [79,]    3
##  [80,]    5
##  [81,]    1
##  [82,]    3
##  [83,]    5
##  [84,]    3
##  [85,]    3
##  [86,]    5
##  [87,]    2
##  [88,]    5
##  [89,]    2
##  [90,]    5
##  [91,]    3
##  [92,]    1
##  [93,]    1
##  [94,]    4
##  [95,]    3
##  [96,]    4
##  [97,]    4
##  [98,]    4
##  [99,]    5
## [100,]    5
#Show top terms
res.terms = as.matrix(terms(res,10))
print(res.terms)
##       Topic 1          Topic 2   Topic 3      Topic 4      Topic 5   
##  [1,] "i"              "percent" "new"        "soviet"     "police"  
##  [2,] "people"         "year"    "york"       "government" "central" 
##  [3,] "state"          "company" "expected"   "official"   "man"     
##  [4,] "years"          "last"    "states"     "two"        "monday"  
##  [5,] "bush"           "new"     "officials"  "union"      "friday"  
##  [6,] "president"      "bank"    "program"    "officials"  "city"    
##  [7,] "get"            "oil"     "california" "war"        "four"    
##  [8,] "told"           "prices"  "week"       "president"  "school"  
##  [9,] "administration" "report"  "air"        "world"      "high"    
## [10,] "dukakis"        "million" "help"       "leaders"    "national"
#Show topic probabilities
res.topicProbs = as.data.frame(res@gamma)
print(res.topicProbs)
##             V1         V2         V3         V4         V5
## 1   0.19169329 0.06070288 0.04472843 0.10223642 0.60063898
## 2   0.12149533 0.14330218 0.08099688 0.58255452 0.07165109
## 3   0.27213115 0.04262295 0.05901639 0.07868852 0.54754098
## 4   0.29571984 0.16731518 0.19844358 0.19455253 0.14396887
## 5   0.31896552 0.15517241 0.20689655 0.14655172 0.17241379
## 6   0.30360934 0.08492569 0.08492569 0.46284501 0.06369427
## 7   0.17050691 0.40092166 0.15668203 0.17050691 0.10138249
## 8   0.37142857 0.15238095 0.14285714 0.20000000 0.13333333
## 9   0.19298246 0.17543860 0.19298246 0.19298246 0.24561404
## 10  0.19879518 0.16265060 0.17469880 0.18674699 0.27710843
## 11  0.21212121 0.20202020 0.16161616 0.15151515 0.27272727
## 12  0.20143885 0.15827338 0.25899281 0.17985612 0.20143885
## 13  0.41395349 0.16279070 0.18139535 0.12558140 0.11627907
## 14  0.17948718 0.17948718 0.12820513 0.30769231 0.20512821
## 15  0.05135952 0.78247734 0.06344411 0.06042296 0.04229607
## 16  0.09770115 0.24712644 0.35632184 0.14942529 0.14942529
## 17  0.43103448 0.18103448 0.09051724 0.10775862 0.18965517
## 18  0.67857143 0.04591837 0.06377551 0.08418367 0.12755102
## 19  0.07083333 0.70000000 0.08750000 0.07500000 0.06666667
## 20  0.15196078 0.05637255 0.69117647 0.04656863 0.05392157
## 21  0.21782178 0.11881188 0.12871287 0.15841584 0.37623762
## 22  0.16666667 0.30000000 0.16666667 0.16666667 0.20000000
## 23  0.19298246 0.21052632 0.17543860 0.21052632 0.21052632
## 24  0.31775701 0.20560748 0.16822430 0.18691589 0.12149533
## 25  0.05121951 0.65121951 0.15365854 0.08536585 0.05853659
## 26  0.11740891 0.09311741 0.08502024 0.37246964 0.33198381
## 27  0.06583072 0.05956113 0.10658307 0.68338558 0.08463950
## 28  0.15068493 0.30136986 0.12328767 0.26027397 0.16438356
## 29  0.07860262 0.04148472 0.05676856 0.68995633 0.13318777
## 30  0.13968254 0.17142857 0.46031746 0.07936508 0.14920635
## 31  0.08405172 0.74784483 0.07112069 0.05172414 0.04525862
## 32  0.66137566 0.10846561 0.06349206 0.07407407 0.09259259
## 33  0.14655172 0.18103448 0.15517241 0.41379310 0.10344828
## 34  0.29605263 0.19736842 0.21052632 0.13157895 0.16447368
## 35  0.08080808 0.05050505 0.10437710 0.07070707 0.69360269
## 36  0.13333333 0.07878788 0.08484848 0.46666667 0.23636364
## 37  0.46202532 0.08227848 0.12974684 0.16139241 0.16455696
## 38  0.09442060 0.07296137 0.12017167 0.64377682 0.06866953
## 39  0.11764706 0.08359133 0.10526316 0.62538700 0.06811146
## 40  0.10869565 0.56521739 0.14492754 0.07246377 0.10869565
## 41  0.07671958 0.43650794 0.16137566 0.25396825 0.07142857
## 42  0.11445783 0.57831325 0.11445783 0.09036145 0.10240964
## 43  0.55793991 0.10944206 0.08798283 0.09442060 0.15021459
## 44  0.40939597 0.10067114 0.22818792 0.12751678 0.13422819
## 45  0.20000000 0.15121951 0.12682927 0.25853659 0.26341463
## 46  0.14828897 0.11406844 0.56653992 0.08365019 0.08745247
## 47  0.09929078 0.41134752 0.13475177 0.22695035 0.12765957
## 48  0.20129870 0.07467532 0.54870130 0.10714286 0.06818182
## 49  0.46800000 0.09600000 0.18400000 0.10400000 0.14800000
## 50  0.22955145 0.08179420 0.05013193 0.60158311 0.03693931
## 51  0.28368794 0.17730496 0.18439716 0.14893617 0.20567376
## 52  0.12977099 0.45801527 0.12977099 0.18320611 0.09923664
## 53  0.10507246 0.14492754 0.55072464 0.06884058 0.13043478
## 54  0.42647059 0.13725490 0.15196078 0.15686275 0.12745098
## 55  0.11881188 0.19801980 0.44554455 0.08910891 0.14851485
## 56  0.22857143 0.15714286 0.13571429 0.37142857 0.10714286
## 57  0.15294118 0.07058824 0.06117647 0.66823529 0.04705882
## 58  0.11494253 0.49425287 0.14367816 0.12068966 0.12643678
## 59  0.13278008 0.04979253 0.13692946 0.26556017 0.41493776
## 60  0.16666667 0.31666667 0.16666667 0.16666667 0.18333333
## 61  0.06796117 0.73786408 0.08090615 0.04854369 0.06472492
## 62  0.12680115 0.12968300 0.58213256 0.12103746 0.04034582
## 63  0.07902736 0.72948328 0.09118541 0.05471125 0.04559271
## 64  0.44285714 0.12142857 0.14285714 0.13214286 0.16071429
## 65  0.19540230 0.31034483 0.19540230 0.14942529 0.14942529
## 66  0.18518519 0.22222222 0.17037037 0.28888889 0.13333333
## 67  0.07024793 0.07851240 0.08677686 0.04545455 0.71900826
## 68  0.10181818 0.48000000 0.14909091 0.12727273 0.14181818
## 69  0.12307692 0.15384615 0.10000000 0.43076923 0.19230769
## 70  0.12745098 0.07352941 0.14215686 0.13235294 0.52450980
## 71  0.21582734 0.10791367 0.16546763 0.14388489 0.36690647
## 72  0.17560976 0.11219512 0.17073171 0.15609756 0.38536585
## 73  0.12280702 0.46198830 0.07602339 0.23976608 0.09941520
## 74  0.20535714 0.16964286 0.17857143 0.14285714 0.30357143
## 75  0.07567568 0.47027027 0.11891892 0.19459459 0.14054054
## 76  0.67310789 0.15619968 0.07407407 0.05152979 0.04508857
## 77  0.63834423 0.07189542 0.09150327 0.11546841 0.08278867
## 78  0.61504425 0.09292035 0.11946903 0.11504425 0.05752212
## 79  0.10971787 0.07523511 0.65830721 0.07210031 0.08463950
## 80  0.11111111 0.08666667 0.11111111 0.05777778 0.63333333
## 81  0.49681529 0.03821656 0.15286624 0.14437367 0.16772824
## 82  0.20111732 0.17318436 0.24022346 0.15642458 0.22905028
## 83  0.10731707 0.15609756 0.11219512 0.23902439 0.38536585
## 84  0.26016260 0.10569106 0.36585366 0.13008130 0.13821138
## 85  0.11525424 0.10508475 0.39322034 0.30508475 0.08135593
## 86  0.15454545 0.06060606 0.15757576 0.09696970 0.53030303
## 87  0.08301887 0.67924528 0.07924528 0.09433962 0.06415094
## 88  0.16666667 0.15972222 0.22916667 0.11805556 0.32638889
## 89  0.12389381 0.47787611 0.09734513 0.14159292 0.15929204
## 90  0.12389381 0.11061947 0.23008850 0.10176991 0.43362832
## 91  0.19724771 0.11009174 0.30275229 0.16972477 0.22018349
## 92  0.33854167 0.13541667 0.12500000 0.11458333 0.28645833
## 93  0.40131579 0.13815789 0.10526316 0.18421053 0.17105263
## 94  0.06930693 0.10231023 0.09240924 0.67656766 0.05940594
## 95  0.09130435 0.15000000 0.65434783 0.03043478 0.07391304
## 96  0.13370474 0.13091922 0.12256267 0.49303621 0.11977716
## 97  0.06709265 0.06070288 0.11501597 0.60383387 0.15335463
## 98  0.16438356 0.16438356 0.17808219 0.28767123 0.20547945
## 99  0.06274510 0.08235294 0.16470588 0.06666667 0.62352941
## 100 0.11627907 0.20465116 0.11162791 0.16744186 0.40000000
#Check that each term is allocated to all topics
print(rowSums(res.topicProbs))
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Note that the highest probability in each row assigns each document to a topic.

Shallow Dive into LDA

Latent Dirichlet Allocation (LDA) was created by David Blei, Andrew Ng, and Michael Jordan in 2003, see their paper titled “Latent Dirichlet Allocation” in the Journal of Machine Learning Research, pp 993–1022.

The simplest way to think about LDA is as a probability model that connects documents with words and topics. The components are:

Next, we connect the above objects to \(K\) topics, indexed by \(l\), i.e., \(t_l\). We will see that LDA is encapsulated in two matrices: Matrix \(A\) and Matrix \(B\).

Matrix \(A\): Connecting Documents with Topics

Matrix \(B\): Connecting Words with Topics

Distribution of Topics in a Document

\[ p(\theta | \alpha) = \frac{\Gamma(\sum_{l=1}^K \alpha_l)}{\prod_{l=1}^K \Gamma(\alpha_l)} \; \prod_{l=1}^K \theta_l^{\alpha_l - 1} \]

where \(\Gamma(\cdot)\) is the Gamma function. - LDA thus gets its name from the use of the Dirichlet distribution, embodied in Matrix \(A\). Since the topics are latent, it explains the rest of the nomenclature. - Given \(\theta\), we sample topics from matrix \(A\) with probability \(p(t | \theta)\).

Distribution of Words and Topics for a Document

\[ p(\theta, {\bf t}, {\bf w}) = p(\theta | \alpha) \prod_{l=1}^K p(t_l | \theta) p(w_l | t_l) \]

\[ p({\bf w}) = \int p(\theta | \alpha) \left(\prod_{l=1}^K \sum_{t_l} p(t_l | \theta) p(w_l | t_l)\; \right) d\theta \]

Likelihood of the entire Corpus

\[ p(D) = \prod_{j=1}^M \int p(\theta_j | \alpha) \left(\prod_{l=1}^K \sum_{t_{jl}} p(t_l | \theta_j) p(w_l | t_l)\; \right) d\theta_j \]

Examples in Finance

Word Embeddings with text2vec

See the original vignette from which this is abstracted. https://cran.r-project.org/web/packages/text2vec/vignettes/text-vectorization.html

library(text2vec)
## Warning: package 'text2vec' was built under R version 3.2.5
## 
## Attaching package: 'text2vec'
## The following object is masked from 'package:qdap':
## 
##     %>%

How to process data quickly using text2vec

Read in the provided data.

library(data.table)
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, last
## The following object is masked from 'package:qdapTools':
## 
##     shift
data("movie_review")
setDT(movie_review)
setkey(movie_review, id)
set.seed(2016L)
all_ids = movie_review$id
train_ids = sample(all_ids, 4000)
test_ids = setdiff(all_ids, train_ids)
train = movie_review[J(train_ids)]
test = movie_review[J(test_ids)]

print(head(train))
##          id sentiment
## 1:  11912_2         0
## 2: 11507_10         1
## 3:   8194_9         1
## 4: 11426_10         1
## 5:   4043_3         0
## 6:  11287_3         0
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    review
## 1:                                                                                                                                           The story behind this movie is very interesting, and in general the plot is not so bad... but the details: writing, directing, continuity, pacing, action sequences, stunts, and use of CG all cheapen and spoil the film.<br /><br />First off, action sequences. They are all quite unexciting. Most consist of someone standing up and getting shot, making no attempt to run, fight, dodge, or whatever, even though they have all the time in the world. The sequences just seem bland for something made in 2004.<br /><br />The CG features very nicely rendered and animated effects, but they come off looking cheap because of how they are used.<br /><br />Pacing: everything happens too quickly. For example, \\"Elle\\" is trained to fight in a couple of hours, and from the start can do back-flips, etc. Why is she so acrobatic? None of this is explained in the movie. As Lilith, she wouldn't have needed to be able to do back flips - maybe she couldn't, since she had wings.<br /><br />Also, we have sequences like a woman getting run over by a car, and getting up and just wandering off into a deserted room with a sink and mirror, and then stabbing herself in the throat, all for no apparent reason, and without any of the spectators really caring that she just got hit by a car (and then felt the secondary effects of another, exploding car)... \\"Are you okay?\\" asks the driver \\"yes, I'm fine\\" she says, bloodied and disheveled.<br /><br />I watched it all, though, because the introduction promised me that it would be interesting... but in the end, the poor execution made me wish for anything else: Blade, Vampire Hunter D, even that movie with vampires where Jackie Chan was comic relief, because they managed to suspend my disbelief, but this just made me want to shake the director awake, and give the writer a good talking to.
## 2:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I remember the original series vividly mostly due to it's unique blend of wry humor and macabre subject matter. Kolchak was hard-bitten newsman from the Ben Hecht school of big-city reporting, and his gritty determination and wise-ass demeanor made even the most mundane episode eminently watchable. My personal fave was \\"The Spanish Moss Murders\\" due to it's totally original storyline. A poor,troubled Cajun youth from Louisiana bayou country, takes part in a sleep research experiment, for the purpose of dream analysis. Something goes inexplicably wrong, and he literally dreams to life a swamp creature inhabiting the dark folk tales of his youth. This malevolent manifestation seeks out all persons who have wronged the dreamer in his conscious state, and brutally suffocates them to death. Kolchak investigates and uncovers this horrible truth, much to the chagrin of police captain Joe \\"Mad Dog\\" Siska(wonderfully essayed by a grumpy Keenan Wynn)and the head sleep researcher played by Second City improv founder, Severn Darden, to droll, understated perfection. The wickedly funny, harrowing finale takes place in the Chicago sewer system, and is a series highlight. Kolchak never got any better. Timeless.
## 3:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Despite the other comments listed here, this is probably the best Dirty Harry movie made; a film that reflects -- for better or worse -- the country's socio-political feelings during the Reagan glory years of the early '80's. It's also a kickass action movie.<br /><br />Opening with a liberal, female judge overturning a murder case due to lack of tangible evidence and then going straight into the coffee shop encounter with several unfortunate hoodlums (the scene which prompts the famous, \\"Go ahead, make my day\\" line), \\"Sudden Impact\\" is one non-stop roller coaster of an action film. The first time you get to catch your breath is when the troublesome Inspector Callahan is sent away to a nearby city to investigate the background of a murdered hood. It gets only better from there with an over-the-top group of grotesque thugs for Callahan to deal with along with a sherriff with a mysterious past. Superb direction and photography and a at-times hilarious script help make this film one of the best of the '80's.
## 4:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I think this movie would be more enjoyable if everyone thought of it as a picture of colonial Africa in the 50's and 60's rather than as a story. Because there is no real story here. Just one vignette on top of another like little points of light that don't mean much until you have enough to paint a picture. The first time I saw Chocolat I didn't really \\"get it\\" until having thought about it for a few days. Then I realized there were lots of things to \\"get\\", including the end of colonialism which was but around the corner, just no plot. Anyway, it's one of my all-time favorite movies. The scene at the airport with the brief shower and beautiful music was sheer poetry. If you like \\"exciting\\" movies, don't watch this--you'll be bored to tears. But, for some of you..., you can thank me later for recommending it to you.
## 5:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    The film begins with promise, but lingers too long in a sepia world of distance and alienation. We are left hanging, but with nothing much else save languid shots of grave and pensive male faces to savour. Certainly no rope up the wall to help us climb over. It's a shame, because the concept is not without merit.<br /><br />We are left wondering why a loving couple - a father and son no less - should be so estranged from the real world that their own world is preferable when claustrophobic beyond all imagining. This loss of presence in the real world is, rather too obviously and unnecessarily, contrasted with the son having enlisted in the armed forces. Why not the circus, so we can at least appreciate some colour? We are left with a gnawing sense of loss, but sadly no enlightenment, which is bewildering given the film is apparently about some form of attainment not available to us all.
## 6: This is a film that had a lot to live down to . on the year of its release legendary film critic Barry Norman considered it the worst film of the year and I'd heard nothing but bad things about it especially a plot that was criticised for being too complicated <br /><br />To be honest the plot is something of a red herring and the film suffers even more when the word \\" plot \\" is used because as far as I can see there is no plot as such . There's something involving Russian gangsters , a character called Pete Thompson who's trying to get his wife Sarah pregnant , and an Irish bloke called Sean . How they all fit into something called a \\" plot \\" I'm not sure . It's difficult to explain the plots of Guy Ritchie films but if you watch any of his films I'm sure we can all agree that they all posses one no matter how complicated they may seem on first viewing . Likewise a James Bond film though the plots are stretched out with action scenes . You will have a serious problem believing RANCID ALUMINIUM has any type of central plot that can be cogently explained <br /><br />Taking a look at the cast list will ring enough warning bells as to what sort of film you'll be watching . Sadie Frost has appeared in some of the worst British films made in the last 15 years and she's doing nothing to become inconsistent . Steven Berkoff gives acting a bad name ( and he plays a character called Kant which sums up the wit of this movie ) while one of the supporting characters is played by a TV presenter presumably because no serious actress would be seen dead in this <br /><br />The only good thing I can say about this movie is that it's utterly forgettable . I saw it a few days ago and immediately after watching I was going to write a very long a critical review warning people what they are letting themselves in for by watching , but by now I've mainly forgotten why . But this doesn't alter the fact that I remember disliking this piece of crap immensely

The processing steps are:

  1. Lower case the documents and then tokenize them.
  2. Create an iterator. (Step 1 can also be done while making the iterator, as the itoken function supports this, see below.)
  3. Use the iterator to create the vocabulary, which is nothing but the list of unique words across all documents.
  4. Vectorize the vocabulary, i.e., create a data structure of words that can be used later for matrix factorizations needed for various text analytics.
  5. Using the iterator and vectorized vocabulary, form text matrices, such as the Document-Term Matrix (DTM) or the Term Co-occurrence Matrix (TCM).
  6. Use the TCM or DTM to undertake various text analytics such as classification, word2vec, topic modeling using LDA (Latent Dirichlet Allocation), and LSA (Latent Semantic Analysis).

Define preprocessing function and tokenization function

prep_fun = tolower
tok_fun = word_tokenizer

#Create an iterator to pass to the create_vocabulary function
it_train = itoken(train$review, 
             preprocessor = prep_fun, 
             tokenizer = tok_fun, 
             ids = train$id, 
             progressbar = FALSE)

#Now create a vocabulary
vocab = create_vocabulary(it_train)
print(vocab)
## Number of docs: 4000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 1 
## Vocabulary: 
##                 terms terms_counts doc_counts
##     1:     overturned            1          1
##     2: disintegration            1          1
##     3:         vachon            1          1
##     4:     interfered            1          1
##     5:      michonoku            1          1
##    ---                                       
## 35592:        penises            2          2
## 35593:        arabian            1          1
## 35594:       personal          102         94
## 35595:            end          921        743
## 35596:        address           10         10

What is an iterator?

An iterator is an object that traverses a container. A list is iterable. See: https://www.r-bloggers.com/iterators-in-r/

Now vectorize

vectorizer = vocab_vectorizer(vocab)

Create the Document Term Matrix (DTM)

dtm_train = create_dtm(it_train, vectorizer)
print(dim(as.matrix(dtm_train)))
## [1]  4000 35596

Classify using the sentiment variable

library(glmnet)
## Warning: package 'glmnet' was built under R version 3.2.4
## Loading required package: foreach
## Loaded glmnet 2.0-5
NFOLDS = 4
res = cv.glmnet(x = dtm_train, y = train[['sentiment']], 
                              family = 'binomial', 
                              alpha = 1,
                              type.measure = "auc",
                              nfolds = NFOLDS,
                              thresh = 1e-3,
                              maxit = 1e3)
plot(res)

Use the fitted model to predict on the test data set.

it_test = test$review %>% prep_fun %>% tok_fun %>%   
  itoken(ids = test$id, progressbar = FALSE)

dtm_test = create_dtm(it_test, vectorizer)

preds = predict(res, dtm_test, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)
## [1] 0.916697

N-Grams

n-grams are phrases made by coupling words that co-occur. For example, a bi-gram is a set of two consecutive words.

vocab = create_vocabulary(it_train, ngram = c(1, 2))
print(vocab)
## Number of docs: 4000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 2 
## Vocabulary: 
##                        terms terms_counts doc_counts
##      1: bad_characterization            1          1
##      2:             few_step            1          1
##      3:            also_took            1          1
##      4:          in_graphics            1          1
##      5:            like_poke            1          1
##     ---                                             
## 397499:       original_uncut            1          1
## 397500:           settle_his            2          2
## 397501:          first_blood            2          1
## 397502:        occasional_at            1          1
## 397503:         the_brothers           14         14

This creates a vocabulary of both single words and b-grams. Notice how large it is compared to the unigram vocabulary from earlier. Because of this we go ahead and prune the vocabulary first, as this will speed up computation.

Redo classification with n-grams.

vocab = vocab %>% prune_vocabulary(term_count_min = 10, 
                   doc_proportion_max = 0.5)
print(vocab)
## Number of docs: 4000 
## 0 stopwords:  ... 
## ngram_min = 1; ngram_max = 2 
## Vocabulary: 
##               terms terms_counts doc_counts
##     1:      morvern           14          1
##     2:   race_films           10          1
##     3:        bazza           11          1
##     4: thunderbirds           10          1
##     5:     mary_lou           21          1
##    ---                                     
## 17866:      br_also           36         36
## 17867:     a_better           96         89
## 17868:     tourists           10         10
## 17869:      in_each           14         14
## 17870: the_brothers           14         14
bigram_vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, bigram_vectorizer)
res = cv.glmnet(x = dtm_train, y = train[['sentiment']], 
                 family = 'binomial', 
                 alpha = 1,
                 type.measure = "auc",
                 nfolds = NFOLDS,
                 thresh = 1e-3,
                 maxit = 1e3)
plot(res)

print(names(res))
##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"
#AUC (area under curve)
print(max(res$cvm))
## [1] 0.9217034

Out-of-sample test

dtm_test = create_dtm(it_test, bigram_vectorizer)
preds = predict(res, dtm_test, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)
## [1] 0.9268974

TF-IDF

We have seen the TF-IDF discussion earlier, and here we see how to implement it using the text2vec package.

vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)

tfidf = TfIdf$new()
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
dtm_test_tfidf  = create_dtm(it_test, vectorizer) %>% transform(tfidf)

Now we take the TF-IDF adjusted DTM and run the classifier.

Refit classifier

res = cv.glmnet(x = dtm_train_tfidf, y = train[['sentiment']], 
                              family = 'binomial', 
                              alpha = 1,
                              type.measure = "auc",
                              nfolds = NFOLDS,
                              thresh = 1e-3,
                              maxit = 1e3)
print(paste("max AUC =", round(max(res$cvm), 4)))
## [1] "max AUC = 0.913"
#Test on hold-out sample
preds = predict(res, dtm_test_tfidf, type = 'response')[,1]
glmnet:::auc(test$sentiment, preds)
## [1] 0.8994684

Word Embeddings (word2vec)

From: http://stackoverflow.com/questions/39514941/preparing-word-embeddings-in-text2vec-r-package

Do the entire creation of the TCM (Term Co-occurrence Matrix)

library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:qdap':
## 
##     %>%
library(text2vec)
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it) %>% prune_vocabulary(term_count_min=10)
vectorizer = vocab_vectorizer(v, grow_dtm = FALSE, skip_grams_window = 5)
tcm = create_tcm(it, vectorizer)
print(dim(tcm))
## [1] 7797 7797

Now fit the word embeddings using GloVe See: http://nlp.stanford.edu/projects/glove/

model = GlobalVectors$new(word_vectors_size=50, vocabulary=v, 
                          x_max=10, learning_rate=0.20)
model$fit(tcm,n_iter=25)
## 2016-12-11 10:13:29 - epoch 1, expected cost 0.0822
## 2016-12-11 10:13:30 - epoch 2, expected cost 0.0504
## 2016-12-11 10:13:31 - epoch 3, expected cost 0.0431
## 2016-12-11 10:13:31 - epoch 4, expected cost 0.0388
## 2016-12-11 10:13:32 - epoch 5, expected cost 0.0359
## 2016-12-11 10:13:33 - epoch 6, expected cost 0.0336
## 2016-12-11 10:13:33 - epoch 7, expected cost 0.0320
## 2016-12-11 10:13:34 - epoch 8, expected cost 0.0306
## 2016-12-11 10:13:34 - epoch 9, expected cost 0.0297
## 2016-12-11 10:13:35 - epoch 10, expected cost 0.0287
## 2016-12-11 10:13:36 - epoch 11, expected cost 0.0280
## 2016-12-11 10:13:36 - epoch 12, expected cost 0.0274
## 2016-12-11 10:13:37 - epoch 13, expected cost 0.0269
## 2016-12-11 10:13:37 - epoch 14, expected cost 0.0264
## 2016-12-11 10:13:38 - epoch 15, expected cost 0.0260
## 2016-12-11 10:13:39 - epoch 16, expected cost 0.0256
## 2016-12-11 10:13:39 - epoch 17, expected cost 0.0253
## 2016-12-11 10:13:40 - epoch 18, expected cost 0.0250
## 2016-12-11 10:13:41 - epoch 19, expected cost 0.0248
## 2016-12-11 10:13:41 - epoch 20, expected cost 0.0246
## 2016-12-11 10:13:42 - epoch 21, expected cost 0.0244
## 2016-12-11 10:13:42 - epoch 22, expected cost 0.0241
## 2016-12-11 10:13:43 - epoch 23, expected cost 0.0240
## 2016-12-11 10:13:44 - epoch 24, expected cost 0.0238
## 2016-12-11 10:13:44 - epoch 25, expected cost 0.0237
wv = model$get_word_vectors()  #Dimension words x wvec_size

Get the distance between words (or find close words)

#Make distance matrix
d = dist2(wv, method="cosine")  #Smaller values means closer
print(dim(d))
## [1] 7797 7797
#Pass: w=word, d=dist matrix, n=nomber of close words
findCloseWords = function(w,d,n) {
  words = rownames(d)
  i = which(words==w)
  if (length(i) > 0) {
    res = sort(d[i,])
    print(as.matrix(res[2:(n+1)]))
  } 
  else {
    print("Word not in corpus.")
  }
}

Example: Show the ten words close to the word “man” and “woman”.

print(findCloseWords("man",d,10))
##             [,1]
## woman  0.1307530
## young  0.2417085
## who    0.2716574
## girl   0.2761752
## guy    0.3217673
## person 0.3422519
## boy    0.3628652
## plays  0.3815644
## kid    0.4020192
## a      0.4031629
##             [,1]
## woman  0.1307530
## young  0.2417085
## who    0.2716574
## girl   0.2761752
## guy    0.3217673
## person 0.3422519
## boy    0.3628652
## plays  0.3815644
## kid    0.4020192
## a      0.4031629
print(findCloseWords("woman",d,10))
##            [,1]
## man   0.1307530
## young 0.1868513
## girl  0.2402866
## guy   0.3020979
## who   0.3086067
## boy   0.3364845
## named 0.3558772
## plays 0.3849196
## old   0.3954155
## lady  0.3958985
##            [,1]
## man   0.1307530
## young 0.1868513
## girl  0.2402866
## guy   0.3020979
## who   0.3086067
## boy   0.3364845
## named 0.3558772
## plays 0.3849196
## old   0.3954155
## lady  0.3958985

This is a very useful feature of word embeddings, as it is often argued that in the embedded space, words that are close to each other, also tend to have semantic similarities, even though the closeness is computed simply by using their co-occurence frequencies.

word2vec (explained)

For more details, see: https://www.quora.com/How-does-word2vec-work

A geometrical interpretation: word2vec is a shallow word embedding model. This means that the model learns to map each discrete word id (0 through the number of words in the vocabulary) into a low-dimensional continuous vector-space from their distributional properties observed in some raw text corpus. Geometrically, one may interpret these vectors as tracing out points on the outside surface of a manifold in the “embedded space”. If we initialize these vectors from a spherical gaussian distribution, then you can imagine this manifold to look something like a hypersphere initially.

Let us focus on the CBOW for now. CBOW is trained to predict the target word t from the contextual words that surround it, c, i.e. the goal is to maximize P(t | c) over the training set. I am simplifying somewhat, but you can show that this probability is roughly inversely proportional to the distance between the current vectors assigned to t and to c. Since this model is trained in an online setting (one example at a time), at time T the goal is therefore to take a small step (mediated by the “learning rate”) in order to minimize the distance between the current vectors for t and c (and thereby increase the probability P(t |c)). By repeating this process over the entire training set, we have that vectors for words that habitually co-occur tend to be nudged closer together, and by gradually lowering the learning rate, this process converges towards some final state of the vectors.

By the Distributional Hypothesis (Firth, 1957; see also the Wikipedia page on Distributional semantics), words with similar distributional properties (i.e. that co-occur regularly) tend to share some aspect of semantic meaning. For example, we may find several sentences in the training set such as “citizens of X protested today” where X (the target word t) may be names of cities or countries that are semantically related.

You can therefore interpret each training step as deforming or morphing the initial manifold by nudging the vectors for some words somewhat closer together, and the result, after projecting down to two dimensions, is the familiar t-SNE visualizations where related words cluster together (e.g. Word representations for NLP).

For the skipgram, the direction of the prediction is simply inverted, i.e. now we try to predict P(citizens | X), P(of | X), etc. This turns out to learn finer-grained vectors when one trains over more data. The main reason is that the CBOW smooths over a lot of the distributional statistics by averaging over all context words while the skipgram does not. With little data, this “regularizing” effect of the CBOW turns out to be helpful, but since data is the ultimate regularizer the skipgram is able to extract more information when more data is available.

There’s a bit more going on behind the scenes, but hopefully this helps to give a useful geometrical intuition as to how these models work.

Topic Analysis

Uses Latent Dirichlet Allocation.

library(tm)
library(text2vec)
stopw = stopwords('en')
stopw = c(stopw,"br","t","s","m","ve","2","d","1")

#Make DTM
data("movie_review")
tokens = movie_review$review %>% tolower %>% word_tokenizer()
it = itoken(tokens)
v = create_vocabulary(it, stopwords = stopw) %>%
  prune_vocabulary(term_count_min=5)
vectrzr = vocab_vectorizer(v, grow_dtm = TRUE, skip_grams_window = 5)
dtm = create_dtm(it, vectrzr)
print(dim(dtm))
## [1]  5000 12733
#Do LDA
lda = LatentDirichletAllocation$new(n_topics=5, v)
lda$fit(dtm,n_iter = 25)
doc_topics = lda$fit_transform(dtm,n_iter = 25)
print(dim(doc_topics))
## [1] 5000    5
#Get word vectors by topic
topic_wv = lda$get_word_vectors()
print(dim(topic_wv))
## [1] 12733     5
#Plot LDA
library(LDAvis)
lda$plot()
## Loading required namespace: servr

This produces a terrific interactive plot.

Latent Semantic Analysis (LSA)

lsa = LatentSemanticAnalysis$new(n_topics = 5)
res = lsa$fit_transform(dtm)
print(dim(res))
## [1] 5000    5

End Note!

Biblio at: http://srdas.github.io/Das_TextAnalyticsInFinance.pdf